1
00:00:11,374 --> 00:00:14,567
- Hello everyone, welcome to CS231.

2
00:00:14,567 --> 00:00:17,618
I'm Song Han. Today I'm
going to give a guest lecture

3
00:00:17,618 --> 00:00:21,468
on the efficient methods and
hardware for deep learning.

4
00:00:21,468 --> 00:00:24,714
So I'm a fifth year PhD
candidate here at Stanford,

5
00:00:24,714 --> 00:00:28,081
advised by Professor Bill Dally.

6
00:00:28,081 --> 00:00:31,093
So, in this course we have seen
a lot of convolution neural

7
00:00:31,093 --> 00:00:33,932
networks, recurrent
neural networks, or even

8
00:00:33,932 --> 00:00:37,358
since last time, the
reinforcement learning.

9
00:00:37,358 --> 00:00:39,281
They are spanning a lot of applications.

10
00:00:39,281 --> 00:00:41,979
For example, the self-=driving
car, machine translation,

11
00:00:41,979 --> 00:00:44,157
AlphaGo and Smart Robots.

12
00:00:44,157 --> 00:00:46,904
And it's changing our
lives, but there is a recent

13
00:00:46,904 --> 00:00:50,781
trend that in order to
achieve such high accuracy,

14
00:00:50,781 --> 00:00:53,652
the models are getting larger and larger.

15
00:00:53,652 --> 00:00:56,669
For example for ImageNet
recognition, the winner from

16
00:00:56,669 --> 00:01:00,502
2012 to 2015, the model
size increased by 16X.

17
00:01:02,519 --> 00:01:05,104
And just in one year,
for Baidu's deep speech

18
00:01:05,104 --> 00:01:07,809
just in one year, the training
operations, the number

19
00:01:07,809 --> 00:01:11,142
of training operations increased by 10X.

20
00:01:12,043 --> 00:01:15,651
So such large model
creates lots of problems,

21
00:01:15,651 --> 00:01:18,941
for example the model size
becomes larger and larger

22
00:01:18,941 --> 00:01:22,413
so it's difficult for
them to be deployed either

23
00:01:22,413 --> 00:01:25,159
on those for example,
on the mobile phones.

24
00:01:25,159 --> 00:01:28,232
If the item is larger
than 100 megabytes, you

25
00:01:28,232 --> 00:01:30,797
cannot download until
you connect to Wi-Fi.

26
00:01:30,797 --> 00:01:33,315
So those product managers
and for example Baidu,

27
00:01:33,315 --> 00:01:36,280
Facebook, they are very sensitive
to the size of the binary

28
00:01:36,280 --> 00:01:37,982
size of their model.

29
00:01:37,982 --> 00:01:40,358
And also for example, the
self-driving car, you can only

30
00:01:40,358 --> 00:01:43,743
do those on over-the-air
update for the model

31
00:01:43,743 --> 00:01:47,130
if the model is too large,
it's also difficult.

32
00:01:47,130 --> 00:01:51,958
And the second challenge
for those large models is

33
00:01:51,958 --> 00:01:55,272
that the training speed is extremely slow.

34
00:01:55,272 --> 00:01:58,930
For example, the ResNet152,
which is only a few, less

35
00:01:58,930 --> 00:02:03,713
than 1% actually, more
accurate than ResNet101.

36
00:02:03,713 --> 00:02:07,046
Takes 1.5 weeks to train on four Maxwell

37
00:02:08,839 --> 00:02:10,589
M40 GPUs for example.

38
00:02:11,703 --> 00:02:15,175
Which greatly limits either
we are doing homework

39
00:02:15,175 --> 00:02:17,422
or if the researcher's
designing new models is

40
00:02:17,422 --> 00:02:19,284
getting pretty slow.

41
00:02:19,284 --> 00:02:22,473
And the third challenge
for those bulky model is

42
00:02:22,473 --> 00:02:24,377
the energy efficiency.

43
00:02:24,377 --> 00:02:27,730
For example, the AlphaGo
beating Lee Sedol last year,

44
00:02:27,730 --> 00:02:31,563
took 2000 CPUs and 300
GPUs, which cost $3,000

45
00:02:33,090 --> 00:02:37,527
just to pay for the electric
bill, which is insane.

46
00:02:37,527 --> 00:02:39,968
So either on those embedded
devices, those models

47
00:02:39,968 --> 00:02:43,100
are draining your battery
power for on data-center

48
00:02:43,100 --> 00:02:46,548
increases the total cost
of ownership of maintaining

49
00:02:46,548 --> 00:02:48,215
a large data-center.

50
00:02:49,250 --> 00:02:51,678
For example, Google in
their blog, they mentioned

51
00:02:51,678 --> 00:02:55,118
if all the users using the
Google Voice Search for

52
00:02:55,118 --> 00:02:58,592
just three minutes, they have
to double their data-center.

53
00:02:58,592 --> 00:03:00,509
So that's a large cost.

54
00:03:01,766 --> 00:03:04,802
So reducing such cost is very important.

55
00:03:04,802 --> 00:03:08,356
And let's see where is
actually the energy consumed.

56
00:03:08,356 --> 00:03:11,024
The large model means
lots of memory access.

57
00:03:11,024 --> 00:03:14,060
You have to access, load
those models from the memory

58
00:03:14,060 --> 00:03:15,869
means more energy.

59
00:03:15,869 --> 00:03:19,541
If you look at how much
energy is consumed by loading

60
00:03:19,541 --> 00:03:23,708
the memory versus how much is
consumed by multiplications

61
00:03:24,852 --> 00:03:29,717
and add those arithmetic
operations, the memory access

62
00:03:29,717 --> 00:03:33,550
is more than two or three
orders of magnitude,

63
00:03:34,579 --> 00:03:38,746
more energy consuming than
those arithmetic operations.

64
00:03:40,191 --> 00:03:43,996
So how to make deep
learning more efficient.

65
00:03:43,996 --> 00:03:47,102
So we have to improve
energy efficiency by this

66
00:03:47,102 --> 00:03:49,852
Algorithm and Hardware Co-Design.

67
00:03:50,700 --> 00:03:53,090
So this is the previous
way, which is our hardware.

68
00:03:53,090 --> 00:03:57,257
For example, we have some
benchmarks say Spec 2006

69
00:03:58,510 --> 00:04:01,039
and then run those
benchmarks and tune your CPU

70
00:04:01,039 --> 00:04:03,956
architectures for those benchmarks.

71
00:04:06,015 --> 00:04:08,823
Now what we should do is
to open up the box to see

72
00:04:08,823 --> 00:04:11,620
what can we do from algorithm
side first and see what

73
00:04:11,620 --> 00:04:15,375
is the optimum question
mark processing unit.

74
00:04:15,375 --> 00:04:18,733
That breaks the
boundary between the algorithm

75
00:04:18,733 --> 00:04:22,316
hardware to improve
the overall efficiency.

76
00:04:26,017 --> 00:04:29,779
So today's talk, I'm going
to have the following agenda.

77
00:04:29,779 --> 00:04:33,910
We are going to cover four
aspects: The algorithm hardware

78
00:04:33,910 --> 00:04:36,071
and inference and training.

79
00:04:36,071 --> 00:04:40,817
So they form a small two by
two matrix, so includes the

80
00:04:40,817 --> 00:04:43,138
algorithm for efficient inference,

81
00:04:43,138 --> 00:04:45,291
hardware for efficient inference

82
00:04:45,291 --> 00:04:47,581
and the algorithm for efficient training,

83
00:04:47,581 --> 00:04:50,976
and lastly, the hardware
for efficient training.

84
00:04:50,976 --> 00:04:53,125
For example, I'm going
to cover the TPU, I'm

85
00:04:53,125 --> 00:04:54,609
going to cover the Volta.

86
00:04:54,609 --> 00:04:58,692
But before I cover those
things, let's have three

87
00:04:59,741 --> 00:05:02,443
slides for Hardware 101.

88
00:05:02,443 --> 00:05:05,180
A brief introduction of
the families of hardware

89
00:05:05,180 --> 00:05:06,430
in such a tree.

90
00:05:07,355 --> 00:05:11,955
So in general, we can
have roughly two branches.

91
00:05:11,955 --> 00:05:14,761
One is general purpose hardware.

92
00:05:14,761 --> 00:05:18,844
It can do any applications
versus the specialized

93
00:05:21,324 --> 00:05:25,249
hardware, which is tuned
for a specific kind of

94
00:05:25,249 --> 00:05:29,113
applications, a domain of applications.

95
00:05:29,113 --> 00:05:31,962
So the general purpose
hardware includes, the CPU

96
00:05:31,962 --> 00:05:35,621
or the GPU, and their
difference is that CPU is

97
00:05:35,621 --> 00:05:38,288
latency oriented, single threaded.

98
00:05:38,288 --> 00:05:40,451
It's like a big elephant.

99
00:05:40,451 --> 00:05:43,534
While the GPU is throughput oriented.

100
00:05:44,486 --> 00:05:46,846
It has many small though
weak threads, but there

101
00:05:46,846 --> 00:05:49,691
are thousands of such small weak cores.

102
00:05:49,691 --> 00:05:54,088
Like a group of small ants,
where there are so many ants.

103
00:05:54,088 --> 00:05:58,255
And specialized hardware,
roughly there are FPGAs and ASICs.

104
00:05:59,126 --> 00:06:03,274
So FPGA stand for Field
Programmable Gate Array.

105
00:06:03,274 --> 00:06:07,748
So it is programmable, hardware
programmable so its

106
00:06:07,748 --> 00:06:09,353
logic can be changed.

107
00:06:09,353 --> 00:06:13,520
So it's cheaper for you to try
new ideas and do prototype,

108
00:06:14,597 --> 00:06:16,262
but it's less efficient.

109
00:06:16,262 --> 00:06:18,185
It's in the middle between
the general purpose and

110
00:06:18,185 --> 00:06:19,018
pure ASIC.

111
00:06:19,965 --> 00:06:24,137
So ASIC stands for Application
Specific Integrated Circuit.

112
00:06:24,137 --> 00:06:25,842
It has a fixed logic, just designed

113
00:06:25,842 --> 00:06:27,293
for a certain application.

114
00:06:27,293 --> 00:06:29,341
For example deep learning.

115
00:06:29,341 --> 00:06:34,264
And Google's TPU is a kind of
ASIC and the neural networks

116
00:06:34,264 --> 00:06:37,852
we train on, the earlier GPUs is here.

117
00:06:37,852 --> 00:06:41,645
And another slide for
Hardware 101 is the number

118
00:06:41,645 --> 00:06:43,657
representations.

119
00:06:43,657 --> 00:06:47,473
So in this slide, I'm going
to convey you the idea that

120
00:06:47,473 --> 00:06:49,924
all the numbers in computer
are not represented

121
00:06:49,924 --> 00:06:51,742
by a real number.

122
00:06:51,742 --> 00:06:54,536
It's not a real number, but
they are actually discrete.

123
00:06:54,536 --> 00:06:57,977
Even for those floating
point with your 32 Bit.

124
00:06:57,977 --> 00:07:02,301
Floating point numbers, their
resolution is not perfect.

125
00:07:02,301 --> 00:07:06,336
It's not continuous, but it's discrete.

126
00:07:06,336 --> 00:07:10,271
So for example FP32, meaning
using a 32 bit to represent

127
00:07:10,271 --> 00:07:12,147
a floating point number.

128
00:07:12,147 --> 00:07:15,296
So there are three components
in the representation.

129
00:07:15,296 --> 00:07:18,907
The sign bit, the
exponent bit, the mantissa,

130
00:07:18,907 --> 00:07:23,682
and the number it represents
is shown by minus 1 to the S

131
00:07:23,682 --> 00:07:26,515
times 1.M times 2 to the exponent.

132
00:07:28,778 --> 00:07:32,745
So similar there is FP16,
using a 16 bit to represent

133
00:07:32,745 --> 00:07:34,745
a floating point number.

134
00:07:36,616 --> 00:07:39,375
In particular, I'm going
to introduce Int8, where

135
00:07:39,375 --> 00:07:43,692
the core TPU use, using an
integer to represent a fixed

136
00:07:43,692 --> 00:07:44,863
point number.

137
00:07:44,863 --> 00:07:47,912
So we have a certain number
of bits for the integer.

138
00:07:47,912 --> 00:07:50,827
Followed by a radix point,
if we put different layers.

139
00:07:50,827 --> 00:07:54,255
And lastly, the fractional bits.

140
00:07:54,255 --> 00:07:58,088
So why do we prefer those
eight bit, or 16 bit

141
00:07:59,257 --> 00:08:01,502
rather than those traditional like the

142
00:08:01,502 --> 00:08:03,844
32 bit floating point.

143
00:08:03,844 --> 00:08:04,856
That's the cost.

144
00:08:04,856 --> 00:08:08,981
So, I generated the figure
from 45 nanometer technology

145
00:08:08,981 --> 00:08:13,189
about the energy cost versus
the area cost for different

146
00:08:13,189 --> 00:08:14,635
operations.

147
00:08:14,635 --> 00:08:18,709
In particular, let's see
here, go you from 32 bit to

148
00:08:18,709 --> 00:08:22,876
16 bit, we have about four
times reduction in energy

149
00:08:24,066 --> 00:08:28,783
and also about four times
reduction in the area.

150
00:08:28,783 --> 00:08:30,966
Area means money.

151
00:08:30,966 --> 00:08:33,751
Every millimeter square takes
money to take out a chip

152
00:08:33,751 --> 00:08:38,592
So it's very beneficial for
hardware design to go from

153
00:08:38,592 --> 00:08:40,009
32 bit to 16 bit.

154
00:08:41,801 --> 00:08:45,968
That's why you hear NVIDIA
from Pascal Architecture,

155
00:08:46,894 --> 00:08:49,821
they said they're
starting to support FP16.

156
00:08:49,821 --> 00:08:53,915
That's the reason why it's so beneficial.

157
00:08:53,915 --> 00:08:57,122
For example, previous battery
level could last four hours,

158
00:08:57,122 --> 00:08:58,662
now it becomes 16 hours.

159
00:08:58,662 --> 00:09:00,269
That's what it means to reduce

160
00:09:00,269 --> 00:09:02,698
the energy cost by four times.

161
00:09:02,698 --> 00:09:07,160
But here still, there's a
problem of large energy costs

162
00:09:07,160 --> 00:09:08,297
for reading the memory.

163
00:09:08,297 --> 00:09:11,771
And let's see how can we deal
with this memory reference

164
00:09:11,771 --> 00:09:16,279
so expensive, how do we deal
with this problem better?

165
00:09:16,279 --> 00:09:19,913
So let's switch gear and
come to our topic directly.

166
00:09:19,913 --> 00:09:24,285
So let's first introduce
algorithm for efficient inference.

167
00:09:24,285 --> 00:09:27,919
So I'm going to cover six topics,
this is a really long slide.

168
00:09:27,919 --> 00:09:30,336
So I'm going to relatively fast.

169
00:09:31,796 --> 00:09:34,747
So the first idea I'm going
to talk about is pruning.

170
00:09:34,747 --> 00:09:36,767
Pruning the neural networks.

171
00:09:36,767 --> 00:09:39,671
For example, this is
original neural network.

172
00:09:39,671 --> 00:09:42,927
So what I'm trying to do is,
can we remove some of the

173
00:09:42,927 --> 00:09:46,260
weight and still have the same accuracy?

174
00:09:47,424 --> 00:09:49,026
It's like pruning a tree, get rid

175
00:09:49,026 --> 00:09:51,838
of those redundant connections.

176
00:09:51,838 --> 00:09:55,540
This is first proposed by
Professor Yann LeCun back in 1989,

177
00:09:55,540 --> 00:09:59,839
and I revisited this problem,
26 years later, on those

178
00:09:59,839 --> 00:10:03,933
modern deep neural nets
to see how it works.

179
00:10:03,933 --> 00:10:06,764
So not all parameters are useful actually.

180
00:10:06,764 --> 00:10:09,388
For example, in this case, if
you want to fit a single line,

181
00:10:09,388 --> 00:10:12,308
but you're using a quadratic
term, apparently the

182
00:10:12,308 --> 00:10:14,808
0.01 is a redundant parameter.

183
00:10:15,977 --> 00:10:18,174
So I'm going to train the
connectivity first and then

184
00:10:18,174 --> 00:10:20,611
prune some of the connections.

185
00:10:20,611 --> 00:10:22,384
And then train the remaining weights,

186
00:10:22,384 --> 00:10:24,364
and through this process, it regulates.

187
00:10:24,364 --> 00:10:28,663
And as a result, I can reduce
the number of connections,

188
00:10:28,663 --> 00:10:31,908
and annex that from 16
million parameters to only

189
00:10:31,908 --> 00:10:35,278
six million parameters,
which is 10 times less

190
00:10:35,278 --> 00:10:36,611
the computation.

191
00:10:37,645 --> 00:10:39,645
So this is the accuracy.

192
00:10:42,842 --> 00:10:46,224
So the x-axis is how much
parameters to prune away

193
00:10:46,224 --> 00:10:49,592
and the y-axis is the accuracy you have.

194
00:10:49,592 --> 00:10:53,180
So we want to have less
parameters, but we also

195
00:10:53,180 --> 00:10:55,834
want to have the same accuracy as before.

196
00:10:55,834 --> 00:10:58,424
We don't want to sacrifice accuracy,

197
00:10:58,424 --> 00:11:02,591
For example at 80%, we
locked zero away left 80%

198
00:11:04,255 --> 00:11:08,257
of the parameters, but
accuracy jumped by 4%.

199
00:11:08,257 --> 00:11:10,097
That's intolerable.

200
00:11:10,097 --> 00:11:12,535
But the good thing is that
if we retrain the remaining

201
00:11:12,535 --> 00:11:16,285
weights, the accuracy
can fully recover here.

202
00:11:18,020 --> 00:11:19,914
And if we do this process iteratively

203
00:11:19,914 --> 00:11:22,997
by pruning and retraining,
pruning and retraining,

204
00:11:22,997 --> 00:11:26,938
we can fully recover the
accuracy not until we are

205
00:11:26,938 --> 00:11:30,479
prune away 90% of the parameters.

206
00:11:30,479 --> 00:11:34,114
So if you go back to home
and try it on your Ipad

207
00:11:34,114 --> 00:11:38,314
or notebook, just zero away
50% of the parameters say

208
00:11:38,314 --> 00:11:41,118
you went on your homework,
you will astonishingly find

209
00:11:41,118 --> 00:11:44,118
that accuracy actually doesn't hurt.

210
00:11:45,087 --> 00:11:47,422
So we just mentioned
convolution neural nets,

211
00:11:47,422 --> 00:11:52,301
how about RNNs and LSTMs, so I
tried with this neural talk.

212
00:11:52,301 --> 00:11:55,637
Again, pruning away 90% of
the rates doesn't hurt the

213
00:11:55,637 --> 00:11:56,554
blue score.

214
00:11:58,385 --> 00:12:00,007
And here are some visualizations.

215
00:12:00,007 --> 00:12:04,401
For example, the original
picture, the neural talk says

216
00:12:04,401 --> 00:12:07,507
a basketball player in a
white uniform is playing

217
00:12:07,507 --> 00:12:08,710
with a ball.

218
00:12:08,710 --> 00:12:12,797
Versus pruning away 90% it
says, a basketball player

219
00:12:12,797 --> 00:12:16,775
in a white uniform is
playing with a basketball.

220
00:12:16,775 --> 00:12:18,192
And on and so on.

221
00:12:19,155 --> 00:12:23,157
But if you're too aggressive,
say you prune away

222
00:12:23,157 --> 00:12:27,324
95% of the weights, the
network is going to get drunk.

223
00:12:28,766 --> 00:12:32,355
It says, a man in a red shirt
and white and black shirt

224
00:12:32,355 --> 00:12:34,345
is running through a field.

225
00:12:34,345 --> 00:12:37,059
So there's really a limit,
a threshold, you have to

226
00:12:37,059 --> 00:12:39,726
take care of during the pruning.

227
00:12:41,095 --> 00:12:43,395
So interestingly, after
I did the work, did some

228
00:12:43,395 --> 00:12:45,788
resource and research and
find actually the same

229
00:12:45,788 --> 00:12:49,524
pruning procedure actually
happens to human brain

230
00:12:49,524 --> 00:12:50,357
as well.

231
00:12:50,357 --> 00:12:54,459
So when we were born, there
are about 50 trillion synapses

232
00:12:54,459 --> 00:12:55,688
in the brain.

233
00:12:55,688 --> 00:13:00,162
And at one year old, this number
surged into 1,000 trillion.

234
00:13:00,162 --> 00:13:04,329
And as we become adolescent,
it becomes smaller actually,

235
00:13:05,201 --> 00:13:09,368
500 trillion in the end,
according to the study by Nature.

236
00:13:11,803 --> 00:13:13,459
So this is very interesting.

237
00:13:13,459 --> 00:13:15,966
And also, the pruning changed
the weight distribution

238
00:13:15,966 --> 00:13:18,957
because we are removing
those small connections

239
00:13:18,957 --> 00:13:22,027
and after we retrain them,
that's why it becomes soft

240
00:13:22,027 --> 00:13:22,944
in the end.

241
00:13:23,939 --> 00:13:25,570
Yeah, question.

242
00:13:25,570 --> 00:13:26,781
- [Student] Are you trying
to mean that it terms

243
00:13:26,781 --> 00:13:29,901
of your mixed weights
during the training will be

244
00:13:29,901 --> 00:13:32,259
just set at zero and
just start from scratch?

245
00:13:32,259 --> 00:13:35,386
And these start from the
things that are at zero.

246
00:13:35,386 --> 00:13:37,411
- Yeah. So the question is,
how do we deal with those

247
00:13:37,411 --> 00:13:39,435
zero connections?

248
00:13:39,435 --> 00:13:43,602
So we force them to be zero
in all the other iterations.

249
00:13:45,369 --> 00:13:46,427
Question?

250
00:13:46,427 --> 00:13:50,153
- [Student] How do you
pick which rates to drop?

251
00:13:50,153 --> 00:13:53,293
- Yeah so very simple. Small
weights, drop it, sort it.

252
00:13:53,293 --> 00:13:54,421
If it's small, just--

253
00:13:54,421 --> 00:13:55,709
- [Student] Any threshold that I decide?

254
00:13:55,709 --> 00:13:57,042
- Exactly, yeah.

255
00:13:59,058 --> 00:14:01,929
So the next idea, weight sharing.

256
00:14:01,929 --> 00:14:05,574
So now we have, remember
our end goal is to remove

257
00:14:05,574 --> 00:14:09,703
connections so that we can
have less memory footprint

258
00:14:09,703 --> 00:14:12,567
so that we can have more
energy efficient deployment.

259
00:14:12,567 --> 00:14:15,361
Now we have less number
of parameters by pruning.

260
00:14:15,361 --> 00:14:19,446
We want to have less number
of bits per parameter

261
00:14:19,446 --> 00:14:23,204
so they're multiplied together
they get a small model.

262
00:14:23,204 --> 00:14:25,287
So the idea is like this.

263
00:14:26,267 --> 00:14:28,445
Not all numbers, not all the weights

264
00:14:28,445 --> 00:14:30,977
has to be the exact number.

265
00:14:30,977 --> 00:14:35,144
For example, 2.09, 2.12 or
all these four weights, you

266
00:14:36,725 --> 00:14:39,867
just put them using 2.0 to represent them.

267
00:14:39,867 --> 00:14:41,278
That's enough.

268
00:14:41,278 --> 00:14:45,445
Otherwise too accurate number
is just leads to overfitting.

269
00:14:46,851 --> 00:14:50,227
So the idea is I can
cluster the weights if they

270
00:14:50,227 --> 00:14:53,278
are similar, just using
a centroid to represent

271
00:14:53,278 --> 00:14:57,558
the number instead of using
the full precision weight.

272
00:14:57,558 --> 00:15:01,094
So that every time I do the
inference, I just do inference

273
00:15:01,094 --> 00:15:03,417
on this single number.

274
00:15:03,417 --> 00:15:06,995
For example, this is a
four by four weight matrix

275
00:15:06,995 --> 00:15:09,027
in a certain layer.

276
00:15:09,027 --> 00:15:12,715
And what I'm going to do is do
k-means clustering by having

277
00:15:12,715 --> 00:15:15,496
the similar weight
sharing the same centroid.

278
00:15:15,496 --> 00:15:19,364
For example, 2.09, 2.12, I store index of

279
00:15:19,364 --> 00:15:21,987
three pointing to here.

280
00:15:21,987 --> 00:15:25,529
So that, the good thing is
we need to only store the

281
00:15:25,529 --> 00:15:29,638
two bit index rather than the
32 bit, floating point number.

282
00:15:29,638 --> 00:15:31,555
That's 16 times saving.

283
00:15:34,577 --> 00:15:37,257
And how do we train such neural network?

284
00:15:37,257 --> 00:15:41,424
They are binded together, so
after we get the gradient,

285
00:15:42,372 --> 00:15:45,540
we color them in the same
pattern as the weight

286
00:15:45,540 --> 00:15:48,354
and then we do a group by
operation by having all

287
00:15:48,354 --> 00:15:52,604
the in that weights with the
same index grouped together.

288
00:15:52,604 --> 00:15:56,034
And then we do a reduction
by summing them up.

289
00:15:56,034 --> 00:15:58,106
And then multiplied by the learning rate

290
00:15:58,106 --> 00:16:00,404
subtracted from the original centroid.

291
00:16:00,404 --> 00:16:04,321
That's one iteration of
the SGD for such weight

292
00:16:05,292 --> 00:16:07,125
shared neural network.

293
00:16:08,613 --> 00:16:10,826
So remember previously,
after pruning this is

294
00:16:10,826 --> 00:16:14,409
what the weight
distribution like and after

295
00:16:16,164 --> 00:16:18,575
weight sharing, they become discrete.

296
00:16:18,575 --> 00:16:21,215
There are only 16 different
values here, meaning

297
00:16:21,215 --> 00:16:25,048
we can use four bits to
represent each number.

298
00:16:26,476 --> 00:16:29,764
And by training on such
weight shared neural network,

299
00:16:29,764 --> 00:16:31,986
training on such extremely
shared neural network,

300
00:16:31,986 --> 00:16:34,756
these weights can adjust.

301
00:16:34,756 --> 00:16:39,146
It is the subtle changes
that compensated for the

302
00:16:39,146 --> 00:16:40,563
loss of accuracy.

303
00:16:41,407 --> 00:16:44,914
So let's see, this is the
number of bits we give it,

304
00:16:44,914 --> 00:16:48,581
this is the accuracy
for convolution layers.

305
00:16:50,095 --> 00:16:54,884
Not until four bits, does
the accuracy begin to drop

306
00:16:54,884 --> 00:16:59,073
and for those fully connected
layers, very astonishingly,

307
00:16:59,073 --> 00:17:02,014
it's not until two bits,
only four number, does the

308
00:17:02,014 --> 00:17:03,702
accuracy begins to drop.

309
00:17:03,702 --> 00:17:06,119
And this result is per layer.

310
00:17:08,470 --> 00:17:12,404
So we have covered two methods,
pruning and weight sharing.

311
00:17:12,404 --> 00:17:15,433
What if we combine these
two methods together.

312
00:17:15,433 --> 00:17:16,982
Do they work well?

313
00:17:16,982 --> 00:17:20,444
So by combining those methods,
this is the compression

314
00:17:20,444 --> 00:17:22,814
ratio with the smaller on the left.

315
00:17:22,814 --> 00:17:24,684
And this is the accuracy.

316
00:17:24,684 --> 00:17:27,382
We can combine it together
and make the model

317
00:17:27,382 --> 00:17:32,364
about 3% of its original
size without hurting the

318
00:17:32,364 --> 00:17:33,804
accuracy at all.

319
00:17:33,804 --> 00:17:36,481
Compared with the each
working individual data by

320
00:17:36,481 --> 00:17:39,492
10%, accuracy begins to drop.

321
00:17:39,492 --> 00:17:41,742
And compared with the
cheap SVD method,

322
00:17:41,742 --> 00:17:44,742
this has a better compression ratio.

323
00:17:46,742 --> 00:17:50,650
And final idea is we can
apply the Huffman Coding

324
00:17:50,650 --> 00:17:55,031
to use more number of bits
for those infrequent numbers,

325
00:17:55,031 --> 00:17:59,061
infrequently appearing weights
and less number of bits

326
00:17:59,061 --> 00:18:03,351
for those more frequently
appearing weights.

327
00:18:03,351 --> 00:18:06,469
So by combining these three
methods, pruning, weight

328
00:18:06,469 --> 00:18:09,709
sharing, and also Huffman
Coding, we can compress the

329
00:18:09,709 --> 00:18:13,490
neural networks, state-of-the-art 
neural networks,

330
00:18:13,490 --> 00:18:17,073
ranging from 10x to
49x without hurting the

331
00:18:20,159 --> 00:18:21,370
prediction accuracy.

332
00:18:21,370 --> 00:18:23,267
Sometimes a little bit better.

333
00:18:23,267 --> 00:18:25,948
But maybe that is noise.

334
00:18:25,948 --> 00:18:30,115
So the next question is, these
models are just pre-trained

335
00:18:31,069 --> 00:18:33,509
models by say Google, Microsoft.

336
00:18:33,509 --> 00:18:37,479
Can we make a compact
model, a pump compact model

337
00:18:37,479 --> 00:18:38,457
to begin with?

338
00:18:38,457 --> 00:18:40,874
Even before such compression?

339
00:18:42,297 --> 00:18:47,098
So SqueezeNet, you may have
already worked with this

340
00:18:47,098 --> 00:18:50,015
neural network model in a homework.

341
00:18:50,978 --> 00:18:55,145
So the idea is we are having
a squeeze layer here to shield

342
00:18:58,639 --> 00:19:01,198
at the three by three
convolution with fewer number of

343
00:19:01,198 --> 00:19:02,031
channels.

344
00:19:03,669 --> 00:19:06,177
So that's where squeeze comes from.

345
00:19:06,177 --> 00:19:10,119
And here we have two branches,
rather than four branches

346
00:19:10,119 --> 00:19:12,286
as in the inception model.

347
00:19:13,919 --> 00:19:16,668
So as a result, the model
is extremely compact.

348
00:19:16,668 --> 00:19:19,370
It doesn't have any 
fully connected layers.

349
00:19:19,370 --> 00:19:20,978
Everything is fully convolutional.

350
00:19:20,978 --> 00:19:23,895
The last layer is a global pooling.

351
00:19:27,338 --> 00:19:31,698
So what if we apply deep
compression algorithm

352
00:19:31,698 --> 00:19:35,738
on such already compact
model will it be getting even

353
00:19:35,738 --> 00:19:36,571
smaller?

354
00:19:38,069 --> 00:19:42,389
So this is AlexNet after
compression, this is SqueezeNet.

355
00:19:42,389 --> 00:19:46,556
Even before compression, it's
50x smaller than AlexNet,

356
00:19:47,498 --> 00:19:49,638
but has the same accuracy.

357
00:19:49,638 --> 00:19:53,805
After compression 510x
smaller, but the same accuracy

358
00:19:56,093 --> 00:19:58,676
only less than half a megabyte.

359
00:20:00,444 --> 00:20:03,544
This means it's very easy
to fit such a small model

360
00:20:03,544 --> 00:20:07,705
on the cache, which is literally

361
00:20:07,705 --> 00:20:09,538
tens of megabyte SRAM.

362
00:20:11,407 --> 00:20:12,865
So what does it mean?

363
00:20:12,865 --> 00:20:15,412
It's possible to achieve speed up.

364
00:20:15,412 --> 00:20:18,964
So this is the speedup, I
measured if all these fully

365
00:20:18,964 --> 00:20:23,131
connected layers only for
now, on the CPU, GPU, and

366
00:20:24,447 --> 00:20:26,601
the mobile GPU, before pruning

367
00:20:26,601 --> 00:20:28,839
and after pruning the weights,

368
00:20:28,839 --> 00:20:33,081
and on average, I observed
a 3x speedup in a CPU,

369
00:20:33,081 --> 00:20:35,409
about 3X speedup on the GPU,

370
00:20:35,409 --> 00:20:39,151
and roughly 5x speedup on
the mobile GPU, which is a

371
00:20:39,151 --> 00:20:39,984
TK1.

372
00:20:41,511 --> 00:20:44,679
And so is the energy efficiency.

373
00:20:44,679 --> 00:20:49,528
In an average improvement
from 3x to 6x on a CPU, GPU,

374
00:20:49,528 --> 00:20:50,778
and mobile GPU.

375
00:20:52,209 --> 00:20:55,876
And these ideas are
used in these companies.

376
00:20:57,998 --> 00:21:00,391
Having talked about when
pruning and when sharing,

377
00:21:00,391 --> 00:21:02,791
which is a non-linear quantization method

378
00:21:02,791 --> 00:21:05,598
and we're going to talk about
quantization, which is, why

379
00:21:05,598 --> 00:21:08,479
do they use in the TPU design?

380
00:21:08,479 --> 00:21:12,671
All the TPU designs use at
only eight bit for inference.

381
00:21:12,671 --> 00:21:15,729
And the way, how they can
use that is because of the

382
00:21:15,729 --> 00:21:16,749
quantization.

383
00:21:16,749 --> 00:21:19,332
And let's see how does it work.

384
00:21:20,248 --> 00:21:24,968
So quantization has this
complicated figure, but

385
00:21:24,968 --> 00:21:26,769
the intuition is very simple.

386
00:21:26,769 --> 00:21:30,351
You run the neural network
and train it with the normal

387
00:21:30,351 --> 00:21:32,268
floating point numbers.

388
00:21:33,849 --> 00:21:37,677
And quantize the weight
and activations by gather

389
00:21:37,677 --> 00:21:39,700
the statistics for each layer.

390
00:21:39,700 --> 00:21:42,860
For example, what is the maximum number,
minimum number,

391
00:21:42,860 --> 00:21:44,863
and how many bits are enough

392
00:21:44,863 --> 00:21:47,511
to represent this dynamic range.

393
00:21:47,511 --> 00:21:51,892
Then you use that number of
bits for the integer part

394
00:21:51,892 --> 00:21:54,201
and the rest of the eight bit or seven bit

395
00:21:54,201 --> 00:21:58,118
for the other part of
the 8 bit representation.

396
00:22:00,241 --> 00:22:05,041
And also we can fine tune in
the floating point format.

397
00:22:05,041 --> 00:22:08,281
Or we can also use feed
forward with fixed point

398
00:22:08,281 --> 00:22:11,509
and back propagation with
update with the floating

399
00:22:11,509 --> 00:22:12,489
point number.

400
00:22:12,489 --> 00:22:17,391
There are lots of different
ideas to have better accuracy.

401
00:22:17,391 --> 00:22:21,409
And this is the result,
for how many number of bits

402
00:22:21,409 --> 00:22:23,121
versus what is the accuracy.

403
00:22:23,121 --> 00:22:26,020
For example, using a fixed,
8 bit, the accuracy for

404
00:22:26,020 --> 00:22:28,871
GoogleNet doesn't drop significantly.

405
00:22:28,871 --> 00:22:33,057
And for VGG-16, it also
remains pretty well for

406
00:22:33,057 --> 00:22:34,100
the accuracy.

407
00:22:34,100 --> 00:22:36,763
While circling down to
a six bit, the accuracy

408
00:22:36,763 --> 00:22:39,680
begins to drop pretty dramatically.

409
00:22:41,641 --> 00:22:44,474
Next idea, low rank approximation.

410
00:22:47,500 --> 00:22:51,083
It turned out that for
a convolution layer,

411
00:22:51,951 --> 00:22:55,949
you can break it into
two convolution layers.

412
00:22:55,949 --> 00:22:59,521
One convolution here, followed
by a one by one convolution.

413
00:22:59,521 --> 00:23:02,441
So that it's like you
break a complicated problem

414
00:23:02,441 --> 00:23:05,380
into two separate small problems.

415
00:23:05,380 --> 00:23:07,401
This is for convolution layer.

416
00:23:07,401 --> 00:23:10,292
As we can see, achieving about

417
00:23:10,292 --> 00:23:14,641
2x speedup, there's almost
no loss of accuracy.

418
00:23:14,641 --> 00:23:18,529
And achieving a speedup
of 5x, roughly a 6%

419
00:23:18,529 --> 00:23:19,946
loss of accuracy.

420
00:23:21,260 --> 00:23:24,020
And this also works for
fully connected layers.

421
00:23:24,020 --> 00:23:28,110
The simplest idea is using
the SVD to break it into

422
00:23:28,110 --> 00:23:30,721
one matrix into two matrices.

423
00:23:30,721 --> 00:23:34,888
And follow this idea, this
paper proposes to use the

424
00:23:36,121 --> 00:23:40,940
Tensor Tree to break down one
fully connected layer into

425
00:23:40,940 --> 00:23:43,631
a tree, lots of fully connected layers.

426
00:23:43,631 --> 00:23:46,131
That's why it's called a tree.

427
00:23:49,001 --> 00:23:52,191
So going even more crazy, can we use only

428
00:23:52,191 --> 00:23:56,671
two weights or three weights
to represent a neural network?

429
00:23:56,671 --> 00:23:59,601
A ternary weight or a binary weight.

430
00:23:59,601 --> 00:24:02,531
We already seen this distribution
before, after pruning.

431
00:24:02,531 --> 00:24:04,911
There's some positive
weights and negative weights.

432
00:24:04,911 --> 00:24:08,791
Can we just use three numbers,
just use one, minus one, zero

433
00:24:08,791 --> 00:24:12,081
to represent the neural network.

434
00:24:12,081 --> 00:24:16,452
This is our recent paper
clear that we maintain

435
00:24:16,452 --> 00:24:20,852
a full precision weight
during training time,

436
00:24:20,852 --> 00:24:24,292
but at inference time, we
only keep the scaling factor

437
00:24:24,292 --> 00:24:26,063
and the ternary weight.

438
00:24:26,063 --> 00:24:30,831
So during inference, we
only need three weights.

439
00:24:30,831 --> 00:24:35,831
That's very efficient and
making the model very small.

440
00:24:35,831 --> 00:24:38,332
This is the proportion
of the positive zero

441
00:24:38,332 --> 00:24:41,700
and negative weights, they can
change during the training.

442
00:24:41,700 --> 00:24:44,200
So is their absolute value.

443
00:24:46,092 --> 00:24:50,236
And this is the visualization of kernels

444
00:24:50,236 --> 00:24:53,809
by this trained ternary quantization.

445
00:24:53,809 --> 00:24:57,976
We can see some of them are
a corner detector like here.

446
00:24:59,336 --> 00:25:00,986
And also here.

447
00:25:00,986 --> 00:25:03,856
Some of them are maybe edge detector.

448
00:25:03,856 --> 00:25:06,107
For example, this filter some of them

449
00:25:06,107 --> 00:25:09,249
are corner detector like here this filter.

450
00:25:09,249 --> 00:25:12,537
Actually we don't need
such fine grain resolution.

451
00:25:12,537 --> 00:25:15,168
Just three weights are enough.

452
00:25:15,168 --> 00:25:19,335
So this is the validation
accuracy on ImageNet with AlexNet.

453
00:25:21,318 --> 00:25:24,238
So the threshline is the baseline accuracy

454
00:25:24,238 --> 00:25:26,529
with floating point 32.

455
00:25:26,529 --> 00:25:29,112
And the red line is our result.

456
00:25:29,979 --> 00:25:34,979
Pretty much the same accuracy
converged compared with

457
00:25:34,979 --> 00:25:37,229
the full precision weights.

458
00:25:40,390 --> 00:25:43,307
Last idea, Winograd Transformation.

459
00:25:44,470 --> 00:25:47,491
So this about how do we
implement deep neural nets,

460
00:25:47,491 --> 00:25:50,001
how do we implement the convolutions.

461
00:25:50,001 --> 00:25:52,430
So this is the conventional direct

462
00:25:52,430 --> 00:25:55,190
convolution implementation method.

463
00:25:55,190 --> 00:25:58,459
The slide credited to
Julien, a friend from Nvidia.

464
00:25:58,459 --> 00:26:01,959
So originally, we just do the element wise

465
00:26:03,298 --> 00:26:06,390
do a dot product for those
nine elements in the filter

466
00:26:06,390 --> 00:26:10,310
and nine elements in the
image and then sum it up.

467
00:26:10,310 --> 00:26:15,179
For example, for every
output we need nine times C

468
00:26:15,179 --> 00:26:18,012
number of multiplication and adds.

469
00:26:19,314 --> 00:26:23,481
Winograd Convolution is another
method, equivalent method.

470
00:26:27,444 --> 00:26:31,491
It's not lost, it's an
equivalent method proposed at

471
00:26:31,491 --> 00:26:33,531
first through this paper, Fast Algorithms

472
00:26:33,531 --> 00:26:35,334
for Convolution Neural Networks.

473
00:26:35,334 --> 00:26:38,212
That instead of directly
doing the convolution, move

474
00:26:38,212 --> 00:26:42,379
it one by one, at first it
transforms the input feature

475
00:26:43,905 --> 00:26:46,155
map to another feature map.

476
00:26:47,066 --> 00:26:51,233
Which contains only the
weight, contains only 1, 0.5, 2

477
00:26:53,396 --> 00:26:56,813
that can efficiently
implement it with shift.

478
00:26:56,813 --> 00:27:00,980
And also transform the filter
into a four by four tensor.

479
00:27:02,076 --> 00:27:06,324
So what we are going to do here
is sum over c and do an element-wise

480
00:27:06,324 --> 00:27:07,824
element-wise product.

481
00:27:08,964 --> 00:27:13,564
So there are only 16
multiplications happening here.

482
00:27:13,564 --> 00:27:18,356
And then we do a inverse
transform to get four outputs.

483
00:27:18,356 --> 00:27:21,175
So the transform and the
inverse transform can be

484
00:27:21,175 --> 00:27:24,932
amortized and the multiplications,
whether it can ignored.

485
00:27:24,932 --> 00:27:29,099
So in order to get four output,
we need nine times channel

486
00:27:30,524 --> 00:27:34,444
times four, which is 36 times channel.

487
00:27:34,444 --> 00:27:39,093
Multiplications originally
for the direct convolution

488
00:27:39,093 --> 00:27:42,676
but now we need 16
times C of our output

489
00:27:46,655 --> 00:27:50,822
So that is 2.25x less
number of multiplications to

490
00:27:53,916 --> 00:27:57,083
perform the exact same multiplication.

491
00:27:58,306 --> 00:27:59,807
And here is a speedup.

492
00:27:59,807 --> 00:28:03,974
2.25x, so theoretically,
2.25x speedup and in real,

493
00:28:07,694 --> 00:28:10,611
from cuDNN 5 they incorporated such

494
00:28:11,570 --> 00:28:14,916
Winograd Convolution algorithm.

495
00:28:14,916 --> 00:28:19,234
This is on the VGG net I
believe, the speedup is

496
00:28:19,234 --> 00:28:21,401
roughly 1.7 to 2x speedup.

497
00:28:23,735 --> 00:28:25,318
Pretty significant.

498
00:28:27,314 --> 00:28:31,147
And after cuDNN 5, the
cuDNN begins to use the

499
00:28:33,564 --> 00:28:36,147
Winograd Convolution algorithm.

500
00:28:38,586 --> 00:28:43,354
Okay, so far we have covered
those efficient algorithms

501
00:28:43,354 --> 00:28:45,666
for efficient inference.

502
00:28:45,666 --> 00:28:48,978
We covered pruning, weight
sharing, quantization,

503
00:28:48,978 --> 00:28:52,061
and also Winograd binary and ternary.

504
00:28:53,436 --> 00:28:57,603
So now let's see what is the
optimal hardware for those

505
00:28:59,196 --> 00:29:00,805
efficient inference?

506
00:29:00,805 --> 00:29:02,888
And what is a Google TPU?

507
00:29:05,018 --> 00:29:08,685
So there are a wide
range of domain specific

508
00:29:09,567 --> 00:29:14,286
architectures or ASICS
for deep neural networks.

509
00:29:14,286 --> 00:29:16,745
They have a common goal
is to minimize the memory

510
00:29:16,745 --> 00:29:18,495
access to save power.

511
00:29:20,595 --> 00:29:24,708
For example the Eyeriss from
MIT by using the RS Dataflow

512
00:29:24,708 --> 00:29:28,028
to minimize the off chip direct access.

513
00:29:28,028 --> 00:29:30,818
And DaDiannao from China
Academy of Science,

514
00:29:30,818 --> 00:29:33,906
buffered all the weights on
chip DRAM instead of having

515
00:29:33,906 --> 00:29:35,823
to go to off-chip DRAM.

516
00:29:37,287 --> 00:29:42,108
So the TPU from Google is
using eight bit integer

517
00:29:42,108 --> 00:29:44,258
to represent the numbers.

518
00:29:44,258 --> 00:29:47,319
And at Stanford I proposed
the EIE architecture

519
00:29:47,319 --> 00:29:49,496
that support those compressed and

520
00:29:49,496 --> 00:29:53,067
sparse deep neural network inference.

521
00:29:53,067 --> 00:29:56,668
So this is what the TPU looks like.

522
00:29:56,668 --> 00:30:00,835
It's actually smartly, can
be put into the disk drive

523
00:30:03,267 --> 00:30:06,267
up to four cards per server.

524
00:30:06,267 --> 00:30:09,039
And this is the high-level architecture

525
00:30:09,039 --> 00:30:10,622
for the Google TPU.

526
00:30:12,386 --> 00:30:17,239
Don't be overwhelmed, it's
actually, the kernel part

527
00:30:17,239 --> 00:30:21,156
here, is this giant matrix
multiplication unit.

528
00:30:23,218 --> 00:30:27,218
So it's a 256 by 256
matrix multiplication unit.

529
00:30:28,698 --> 00:30:32,531
So in one single cycle,
it can perform 64 kilo

530
00:30:37,177 --> 00:30:41,028
those number of multiplication
and accumulate operations.

531
00:30:41,028 --> 00:30:44,861
So running 700 Megahertz,
the throughput is 92

532
00:30:47,708 --> 00:30:49,208
Teraops per second

533
00:30:52,380 --> 00:30:55,319
because it's actually integer operation.

534
00:30:55,319 --> 00:30:59,486
So we just about 25x as GPU
and more than 100x at the CPU.

535
00:31:01,799 --> 00:31:05,966
And notice, TPU has a really
large software-managed

536
00:31:07,711 --> 00:31:09,541
on-chip buffer.

537
00:31:09,541 --> 00:31:11,124
It is 24 megabytes.

538
00:31:13,550 --> 00:31:18,375
The cache for the CPU the
L3 cache is already

539
00:31:18,375 --> 00:31:19,720
16 megabytes.

540
00:31:19,720 --> 00:31:24,093
This is 24 megabytes
which is pretty large.

541
00:31:24,093 --> 00:31:28,453
And it's powered by
two DDR3 DRAM channels.

542
00:31:28,453 --> 00:31:32,536
So this is a little weak
because the bandwidth is

543
00:31:33,783 --> 00:31:37,950
only 30 gigabytes per second
compared with the most

544
00:31:39,151 --> 00:31:42,984
recent GPU that HBM, 900
Gigabytes per second.

545
00:31:47,543 --> 00:31:51,751
The DDR4 is released in 2014,
so that makes sense because

546
00:31:51,751 --> 00:31:55,493
the design is a little during
that day, used the DDR3.

547
00:31:55,493 --> 00:32:00,391
But if you're using DDR4 or
even high-bandwidth memory,

548
00:32:00,391 --> 00:32:03,391
the performance can be even boosted.

549
00:32:05,011 --> 00:32:08,303
So this is a comparison
about Google's TPU compared

550
00:32:08,303 --> 00:32:12,470
with the CPU, GPU of this K80
GPU by the way, and the TPU.

551
00:32:15,800 --> 00:32:19,743
So the area is pretty much
smaller, like half the size of a

552
00:32:19,743 --> 00:32:23,910
CPU and GPU and the power
consumption is roughly 75 watts.

553
00:32:28,562 --> 00:32:32,562
And see this number, the
peak teraops per second

554
00:32:33,482 --> 00:32:38,103
is much higher than the
CPU and GPU is, about 90

555
00:32:38,103 --> 00:32:41,520
teraops per second, which is pretty high.

556
00:32:42,602 --> 00:32:44,922
So here is a workload.

557
00:32:44,922 --> 00:32:47,983
Thanks to David sharing the slide.

558
00:32:47,983 --> 00:32:51,060
This is the workload at Google.

559
00:32:51,060 --> 00:32:54,380
They did a benchmark on these TPUs.

560
00:32:54,380 --> 00:32:58,804
So it's a little interesting
that convolution neural nets

561
00:32:58,804 --> 00:33:03,711
only account for 5% of
data-center workload.

562
00:33:03,711 --> 00:33:06,860
Most of them is multilayer perception,

563
00:33:06,860 --> 00:33:08,329
those fully connected layers.

564
00:33:08,329 --> 00:33:12,569
About 61% maybe for ads, I'm not sure.

565
00:33:12,569 --> 00:33:17,058
And about 29% of the workload
in data-center is the

566
00:33:17,058 --> 00:33:18,369
Long Short Term Memory.

567
00:33:18,369 --> 00:33:20,391
For example, speech recognition,

568
00:33:20,391 --> 00:33:23,224
or machine translation, I suspect.

569
00:33:28,475 --> 00:33:31,129
Remember just now we have seen there are

570
00:33:31,129 --> 00:33:33,569
90 teraops per second.

571
00:33:33,569 --> 00:33:37,671
But what actually number
of teraops per second

572
00:33:37,671 --> 00:33:39,239
can be achieved?

573
00:33:39,239 --> 00:33:43,449
This is a basic tool to
measure the bottleneck

574
00:33:43,449 --> 00:33:45,688
of a computer system.

575
00:33:45,688 --> 00:33:49,647
Whether you are bottlenecked
by the arithmetic or

576
00:33:49,647 --> 00:33:53,267
you are bottlenecked by
the memory bandwidth.

577
00:33:53,267 --> 00:33:54,817
It's like if you have a bucket,

578
00:33:54,817 --> 00:33:58,548
the lowest part of the
bucket determines how much

579
00:33:58,548 --> 00:34:01,087
water we can hold in the bucket.

580
00:34:01,087 --> 00:34:04,337
So in this region, you are bottlenecked

581
00:34:05,927 --> 00:34:07,977
by the memory bandwidth.

582
00:34:07,977 --> 00:34:11,477
So the x-axis is the arithmetic intensity.

583
00:34:13,945 --> 00:34:18,112
Which is number of floating
point operations per byte

584
00:34:19,745 --> 00:34:22,415
the ratio between the
computation and memory

585
00:34:22,415 --> 00:34:24,248
of bandwidth overhead.

586
00:34:26,047 --> 00:34:30,214
So the y-axis, is the actual
attainable performance.

587
00:34:32,967 --> 00:34:36,664
Here is the peak performance for example.

588
00:34:36,664 --> 00:34:40,116
When you do a lot of operation
after you fetch a single

589
00:34:40,116 --> 00:34:42,574
piece of data, if you
can do a lot of operation

590
00:34:42,574 --> 00:34:46,995
on top of it, then you are
bottlenecked by the arithmetic.

591
00:34:46,996 --> 00:34:51,714
But after you fetch a lot
of data from the memory,

592
00:34:51,714 --> 00:34:55,916
but you just do a tiny
little bit of arithmetic,

593
00:34:55,916 --> 00:35:00,054
then you will be bottlenecked
by the memory bandwidth.

594
00:35:00,054 --> 00:35:04,704
So how much you can fetch
from the memory determines

595
00:35:04,704 --> 00:35:08,214
how much real performance you can get.

596
00:35:08,214 --> 00:35:10,065
And remember there is a ratio.

597
00:35:10,065 --> 00:35:15,047
When it is one here, this
region it happens to be the same

598
00:35:15,047 --> 00:35:17,854
as the turning point is the actual

599
00:35:17,854 --> 00:35:20,521
memory bandwidth of your system.

600
00:35:21,476 --> 00:35:24,407
So let's see what is the life for the TPU.

601
00:35:24,407 --> 00:35:26,825
The TPU's peak performance is really high,

602
00:35:26,825 --> 00:35:28,908
about 90 Tops per second.

603
00:35:30,623 --> 00:35:34,790
For those convolution nets,
they are pretty much saturating

604
00:35:39,915 --> 00:35:41,825
the peak performance.

605
00:35:41,825 --> 00:35:45,644
But there are lot of neural
networks that has a utlitization

606
00:35:45,644 --> 00:35:47,227
less than 10%,

607
00:35:49,905 --> 00:35:53,572
meaning that 90 T-ops
per second is actually

608
00:35:54,985 --> 00:35:59,152
achieves about three to 12
T-ops per second in real case.

609
00:36:03,244 --> 00:36:05,185
But why is it like that?

610
00:36:05,185 --> 00:36:09,352
The reason is, in order to
have those real-time guarantee

611
00:36:10,882 --> 00:36:14,691
that the user not wait for
too long, you cannot batch

612
00:36:14,691 --> 00:36:18,002
a lot of user's images
or speech voice data

613
00:36:18,002 --> 00:36:19,354
at the same time.

614
00:36:19,354 --> 00:36:22,811
So as a result, for those
fully connect layers,

615
00:36:22,811 --> 00:36:26,978
they have very little reuse,
so they are bottlenecked

616
00:36:28,634 --> 00:36:31,453
by the memory bandwidth.

617
00:36:31,453 --> 00:36:35,584
For those convolution neural
nets, for example this one,

618
00:36:35,584 --> 00:36:39,417
this blue one, that
achieve 86, which is CNN0.

619
00:36:42,333 --> 00:36:44,750
The ratio between the ops and

620
00:36:48,632 --> 00:36:51,872
the number of memory is the highest.

621
00:36:51,872 --> 00:36:56,039
It's pretty high, more than
2,000 compared with other

622
00:36:57,722 --> 00:37:00,722
multilayer perceptron or
long short term memory

623
00:37:00,722 --> 00:37:02,722
the ratio is pretty low.

624
00:37:04,389 --> 00:37:08,556
So this figure compares, this
is the TPU and this one is

625
00:37:09,682 --> 00:37:11,765
the CPU, this is the GPU.

626
00:37:13,021 --> 00:37:16,352
Here is memory bandwidth,
the peak memory bandwidth

627
00:37:16,352 --> 00:37:17,792
at a ratio of one here.

628
00:37:17,792 --> 00:37:20,538
So TPU has the highest memory bandwidth.

629
00:37:20,538 --> 00:37:24,402
And here is where are
these neural networks

630
00:37:24,402 --> 00:37:26,072
lie on this curve.

631
00:37:26,072 --> 00:37:28,538
So the asterisk is for the TPU.

632
00:37:28,538 --> 00:37:31,371
It's still higher than other dots,

633
00:37:32,890 --> 00:37:37,057
but if you're not comfortable
with this log scale figure,

634
00:37:38,232 --> 00:37:42,399
this is what it's like
putting it in linear roofline.

635
00:37:43,781 --> 00:37:46,819
So pretty much everything
disappeared except

636
00:37:46,819 --> 00:37:48,486
for the TPU results.

637
00:37:51,562 --> 00:37:54,381
So still, all these lines,
although they are higher

638
00:37:54,381 --> 00:37:57,282
than the CPU and GPU,
it's still way below the

639
00:37:57,282 --> 00:38:00,532
theoretical peak operations per second.

640
00:38:06,031 --> 00:38:08,802
So as I mentioned before,
it is really bottlenecked

641
00:38:08,802 --> 00:38:11,780
by the low latency requirement
so that it can have

642
00:38:11,780 --> 00:38:13,402
a large batch size.

643
00:38:13,402 --> 00:38:16,762
That's why you have low
operations per byte.

644
00:38:16,762 --> 00:38:18,610
And how do you solve this problem?

645
00:38:18,610 --> 00:38:21,250
You want to have less
number of memory footprint

646
00:38:21,250 --> 00:38:25,417
so that it can reduce the
memory bandwidth requirement.

647
00:38:27,219 --> 00:38:30,449
One solution is to compress
the model and the challenge

648
00:38:30,449 --> 00:38:35,179
is how do we build a hardware
that can do inference

649
00:38:35,179 --> 00:38:38,387
directly on the compressed model?

650
00:38:38,387 --> 00:38:42,238
So I'm going to introduce my
design of EIE, the Efficient

651
00:38:42,238 --> 00:38:46,347
Inference Engine, which
deals with those sparse

652
00:38:46,347 --> 00:38:49,755
and the compressed model to
save the memory bandwidth.

653
00:38:49,755 --> 00:38:52,124
And the rule of thumb, like
we mentioned before is taking

654
00:38:52,124 --> 00:38:53,995
out one bit of sparsity first.

655
00:38:53,995 --> 00:38:56,366
Anything times zero is zero.

656
00:38:56,366 --> 00:38:59,697
So don't store it, don't compute on it.

657
00:38:59,697 --> 00:39:04,286
And second idea is, you don't
need that much full precision,

658
00:39:04,286 --> 00:39:06,857
but you can approximate it.

659
00:39:06,857 --> 00:39:10,279
So by taking advantage
of the sparse weight, we

660
00:39:10,279 --> 00:39:15,097
get about a 10x saving in
the computation, 5x less

661
00:39:15,097 --> 00:39:16,345
memory footprint.

662
00:39:16,345 --> 00:39:19,645
The 2x difference is
due to index overhead.

663
00:39:19,645 --> 00:39:22,555
And by taking advantage
of the sparse activation,

664
00:39:22,555 --> 00:39:26,633
meaning after bandwidth,
if activation is zero, then

665
00:39:26,633 --> 00:39:27,795
ignore it.

666
00:39:27,795 --> 00:39:30,712
You save another 3x of computation.

667
00:39:32,454 --> 00:39:35,465
And then by such weight sharing mechanism,

668
00:39:35,465 --> 00:39:39,382
you can use four bits to
represent each weight rather

669
00:39:39,382 --> 00:39:41,144
than 32 bit.

670
00:39:41,144 --> 00:39:45,311
That's another eight times
saving in the memory footprint.

671
00:39:48,195 --> 00:39:51,894
So this is physically, logically
how the weights are stored.

672
00:39:51,894 --> 00:39:56,214
A four by eight matrix,
and this is how physically

673
00:39:56,214 --> 00:39:57,475
they are stored.

674
00:39:57,475 --> 00:40:00,558
Only the non-zero weights are stored.

675
00:40:02,294 --> 00:40:04,995
So you don't need to store those zeroes.

676
00:40:04,995 --> 00:40:07,675
You'll save the bandwidth
fetching those zeroes.

677
00:40:07,675 --> 00:40:12,334
And also I'm using the
relative index to further save

678
00:40:12,334 --> 00:40:14,834
the number of memory overhead.

679
00:40:21,254 --> 00:40:25,634
So in the computation
like this figure shows,

680
00:40:25,634 --> 00:40:29,801
we are running the
multiplication only on non-zero.

681
00:40:31,283 --> 00:40:33,533
If it's zero, then skip it.

682
00:40:34,585 --> 00:40:38,002
Only broadcast it to the non-zero weights

683
00:40:39,123 --> 00:40:42,131
and if it is zero, skip it.

684
00:40:42,131 --> 00:40:45,883
If it's a non-zero, do the multiplication.

685
00:40:45,883 --> 00:40:48,499
In another cycle, do the multiplication.

686
00:40:48,499 --> 00:40:52,666
So the idea is anything
multiplied by zero is zero.

687
00:40:54,142 --> 00:40:55,820
So this is a little complicated,

688
00:40:55,820 --> 00:40:58,283
I'm going to go very quickly.

689
00:40:58,283 --> 00:41:01,428
I'm going to have a lookup
table that decode the four bit

690
00:41:01,428 --> 00:41:04,923
weight into the 16 bit
weight and using the four bit

691
00:41:04,923 --> 00:41:08,083
relative index passed
through address accumulator

692
00:41:08,083 --> 00:41:11,411
to get the 16 bit absolute index.

693
00:41:11,411 --> 00:41:13,393
And this is what the hardware architecture

694
00:41:13,393 --> 00:41:15,323
like in the high level.

695
00:41:15,323 --> 00:41:19,723
You can feel free to refer
to my paper for detail.

696
00:41:19,723 --> 00:41:21,523
Okay speedup.

697
00:41:21,523 --> 00:41:24,203
So using such efficient
hardware architecture

698
00:41:24,203 --> 00:41:28,203
and also model compression,
this is the original

699
00:41:29,713 --> 00:41:32,553
result we have seen for
CPU, GPU, mobile GPU.

700
00:41:32,553 --> 00:41:34,592
Now EIE is here.

701
00:41:34,592 --> 00:41:38,759
189 times faster than the
CPU and about 13 times faster

702
00:41:39,833 --> 00:41:40,916
than the GPU.

703
00:41:43,302 --> 00:41:46,941
So this is the energy
efficiency on the log scale,

704
00:41:46,941 --> 00:41:50,763
it's about 24,000x more
energy efficient than a CPU

705
00:41:50,763 --> 00:41:55,043
and about 3000x more energy
efficient than a GPU.

706
00:41:55,043 --> 00:41:58,318
It means for example,
previously if your battery can

707
00:41:58,318 --> 00:42:00,934
last for one hour, now it can last for

708
00:42:00,934 --> 00:42:02,851
3000 hours for example.

709
00:42:06,174 --> 00:42:09,952
So if you say, ASIC is always
better than CPUs and GPUs

710
00:42:09,952 --> 00:42:12,294
because it's customized hardware.

711
00:42:12,294 --> 00:42:16,442
So this is comparing EIE with
the peer ASIC, for example

712
00:42:16,442 --> 00:42:18,775
DaDianNao and the TrueNorth.

713
00:42:20,803 --> 00:42:25,305
It has a better throughput,
better energy efficiency

714
00:42:25,305 --> 00:42:28,825
by order of magnitude,
compared with other ASICs.

715
00:42:28,825 --> 00:42:31,992
Not to mention that CPU, GPU and FPGAs.

716
00:42:33,134 --> 00:42:36,384
So we have covered half of the journey.

717
00:42:37,534 --> 00:42:39,812
We mentioned inference, we pretty much

718
00:42:39,812 --> 00:42:41,723
covered everything for inference.

719
00:42:41,723 --> 00:42:44,625
Now we are going to switch
gear and talk about training.

720
00:42:44,625 --> 00:42:47,011
How do we train neural
networks efficiently,

721
00:42:47,011 --> 00:42:48,931
how do we train it faster?

722
00:42:48,931 --> 00:42:51,811
So again, we are starting
with algorithm first,

723
00:42:51,811 --> 00:42:55,262
efficient algorithms
followed by the hardware

724
00:42:55,262 --> 00:42:57,179
for efficient training.

725
00:43:00,479 --> 00:43:03,161
So for efficient training
algorithms, I'm going to mention

726
00:43:03,161 --> 00:43:04,198
four topics.

727
00:43:04,198 --> 00:43:07,959
The first one is parallelization,
and then mixed precision

728
00:43:07,959 --> 00:43:12,131
training, which was just
released about one month ago

729
00:43:12,131 --> 00:43:15,768
and at NVIDIA GTC,
so it's fresh knowledge.

730
00:43:15,768 --> 00:43:18,971
And then model distillation,
followed by my work on

731
00:43:18,971 --> 00:43:20,961
Dense-Sparse-Dense training,
or better Regularization

732
00:43:20,961 --> 00:43:21,794
technique.

733
00:43:22,681 --> 00:43:26,121
So let's start with parallelization.

734
00:43:26,121 --> 00:43:29,542
So this figure shows, anyone in the hardware community.

735
00:43:29,542 --> 00:43:31,229
Most are very familiar with this figure.

736
00:43:31,229 --> 00:43:35,038
So as time goes by, what is the trend?

737
00:43:35,038 --> 00:43:38,422
For the number of transistors
is keeping increasing.

738
00:43:38,422 --> 00:43:43,030
But the single threaded
performance is getting plateaued

739
00:43:43,030 --> 00:43:44,371
in recent years.

740
00:43:44,371 --> 00:43:48,161
And also the frequency is getting
plateaued in recent years.

741
00:43:48,161 --> 00:43:52,350
Because of the power
constraint, to stop not scaling.

742
00:43:52,350 --> 00:43:56,517
And interesting thing is the
number of cores is increasing.

743
00:43:57,757 --> 00:44:00,198
So what we really need
to do is parallelization.

744
00:44:00,198 --> 00:44:03,427
How do we parallelize the
problem to take advantage

745
00:44:03,427 --> 00:44:05,827
of parallel processing?

746
00:44:05,827 --> 00:44:10,804
Actually there are a lot of
opportunities for parallelism

747
00:44:10,804 --> 00:44:12,756
in deep neural networks.

748
00:44:12,756 --> 00:44:15,572
For example, we can do data parallel.

749
00:44:15,572 --> 00:44:20,332
For example, feeding two
images into the same model

750
00:44:20,332 --> 00:44:23,026
and run them at the same time.

751
00:44:23,026 --> 00:44:26,156
This doesn't affect
latency for a single input.

752
00:44:26,156 --> 00:44:30,786
It doesn't make it shorter,
but it makes batch size larger

753
00:44:30,786 --> 00:44:35,084
basically if you have four
machines our effective batch

754
00:44:35,084 --> 00:44:38,626
size becomes four times as before.

755
00:44:38,626 --> 00:44:42,684
So it requires the
coordinated weight update.

756
00:44:42,684 --> 00:44:46,101
For example, this is a paper from Google.

757
00:44:46,973 --> 00:44:51,140
There is a parameter server
as a master and a couple of

758
00:44:52,564 --> 00:44:56,731
slaves running their own piece
of training data and update

759
00:44:59,032 --> 00:45:03,154
the gradient to the parameter
server and get the updated

760
00:45:03,154 --> 00:45:05,571
weight for them individually,

761
00:45:07,312 --> 00:45:11,063
that's how data parallelism is handled.

762
00:45:11,063 --> 00:45:14,604
Another idea is there could
be a model parallelism.

763
00:45:14,604 --> 00:45:17,524
You can sublet your model and handle it

764
00:45:17,524 --> 00:45:21,383
to different processors
or different threads.

765
00:45:21,383 --> 00:45:25,543
For example, there's this image,
you want to run convolution

766
00:45:25,543 --> 00:45:29,293
on this image that is
six dimension for loop.

767
00:45:30,530 --> 00:45:35,271
What you can do is you
can cut the input image by

768
00:45:35,271 --> 00:45:39,482
two by two blocks so that
each thread, or each processor

769
00:45:39,482 --> 00:45:42,619
handles one fourth of the image.

770
00:45:42,619 --> 00:45:45,580
Although there's a small
halo here in between you

771
00:45:45,580 --> 00:45:47,330
have to take care of.

772
00:45:48,260 --> 00:45:50,860
And also, you can parallelize by the

773
00:45:50,860 --> 00:45:53,193
output or input feature map.

774
00:45:54,730 --> 00:45:56,911
And for those fully connect layers,

775
00:45:56,911 --> 00:45:58,500
how do we parallelize the model?

776
00:45:58,500 --> 00:45:59,442
It's even simpler.

777
00:45:59,442 --> 00:46:02,420
You can cut the model into half

778
00:46:02,420 --> 00:46:05,337
and hand it to different threads.

779
00:46:06,551 --> 00:46:07,991
And the third idea, you can even do

780
00:46:07,991 --> 00:46:09,378
hyper-parameter parallel.

781
00:46:09,378 --> 00:46:11,762
For example, you can tune
your learning rate, your

782
00:46:11,762 --> 00:46:14,402
weight decay for different machines

783
00:46:14,402 --> 00:46:16,400
for those coarse-grained parallelism.

784
00:46:16,400 --> 00:46:20,780
So there are so many
alternatives you have to tune.

785
00:46:20,780 --> 00:46:23,631
Small summary of the parallelism.

786
00:46:23,631 --> 00:46:27,031
There are lots of parallelisms
in deep neural networks.

787
00:46:27,031 --> 00:46:30,271
For example, with data
parallelism, you can run multiple

788
00:46:30,271 --> 00:46:34,820
training images, but you
cannot have unlimited number

789
00:46:34,820 --> 00:46:38,940
of processors because you
are limited by batch size.

790
00:46:38,940 --> 00:46:42,068
If it's too large, stochastic gradient descent

791
00:46:42,068 --> 00:46:44,438
becomes gradient descent, that's not good.

792
00:46:44,438 --> 00:46:47,277
You can also run the model parallelism.

793
00:46:47,277 --> 00:46:50,466
Split the model, either
by cutting the image or

794
00:46:50,466 --> 00:46:53,133
cutting the convolution weights.

795
00:46:58,598 --> 00:47:01,223
Either cutting the image or cutting

796
00:47:01,223 --> 00:47:03,940
the fully connected layers.

797
00:47:03,940 --> 00:47:08,319
So it's very easy to get 16
to 64 GPUs training one model

798
00:47:08,319 --> 00:47:10,490
in parallel, having very good speedup.

799
00:47:10,490 --> 00:47:12,323
Almost linear speedup.

800
00:47:13,810 --> 00:47:17,988
Okay, next interesting
thing, mixed precision with

801
00:47:17,988 --> 00:47:19,071
FP16 or FP32.

802
00:47:21,319 --> 00:47:23,370
So remember in the
beginning of this lecture,

803
00:47:23,370 --> 00:47:28,207
I had a chart showing the
energy and area overhead for

804
00:47:28,207 --> 00:47:30,290
a 16 bit versus a 32 bit.

805
00:47:31,887 --> 00:47:36,054
Going from 32 bit to 16 bit,
you save about 4x the energy

806
00:47:37,890 --> 00:47:39,223
and 4x the area.

807
00:47:40,528 --> 00:47:43,340
So can we train a deep
neural network with such low

808
00:47:43,340 --> 00:47:47,831
precision with floating point
16 bit rather than 32 bit?

809
00:47:47,831 --> 00:47:50,998
It turns out we can do that partially.

810
00:47:53,498 --> 00:47:58,250
By partially, I mean we
need FP32 in some places.

811
00:47:58,250 --> 00:48:01,090
And where are those places?

812
00:48:01,090 --> 00:48:05,257
So we can do the multiplication
in 16 bit as input.

813
00:48:07,951 --> 00:48:11,476
And then we have to do the summation

814
00:48:11,476 --> 00:48:13,879
in 32 bit accumulation.

815
00:48:13,879 --> 00:48:18,860
And then convert the result
to 32 bit to store the weight.

816
00:48:18,860 --> 00:48:22,777
So that's where the mixed
precision comes from.

817
00:48:25,108 --> 00:48:28,140
So for example, we have
a master weight stored in

818
00:48:28,140 --> 00:48:31,932
floating point 32, we down
converted it to floating

819
00:48:31,932 --> 00:48:36,099
point 16 and then we do the
feed forward with 16 bit

820
00:48:37,612 --> 00:48:42,290
weight, 16 bit activation,
we get a 16 bit activation

821
00:48:42,290 --> 00:48:46,522
here in the end when we
are doing back propagation

822
00:48:46,522 --> 00:48:50,689
of the computation is also done
with floating point 16 bit.

823
00:48:52,700 --> 00:48:57,351
Very interesting here, for
the weights we get a floating

824
00:48:57,351 --> 00:49:00,851
point 16 bit gradient here for the weight.

825
00:49:03,255 --> 00:49:07,422
But when we are doing the
update, so W plus learning

826
00:49:09,598 --> 00:49:13,154
rate times the gradient,
that operation has

827
00:49:13,154 --> 00:49:14,904
to be done in 32 bit.

828
00:49:17,740 --> 00:49:20,943
That's where the mixed
precision is coming from.

829
00:49:20,943 --> 00:49:24,692
And see there are two
colors, which here is 16 bit,

830
00:49:24,692 --> 00:49:26,514
here is the 32 bit.

831
00:49:26,514 --> 00:49:30,181
That's where the mixed
precision comes from.

832
00:49:31,284 --> 00:49:36,212
So does such low precision
sacrifice your prediction

833
00:49:36,212 --> 00:49:38,884
accuracy for your model?

834
00:49:38,884 --> 00:49:43,051
So this is the figure from
NVIDIA just released a couple

835
00:49:43,914 --> 00:49:45,747
of weeks ago actually.

836
00:49:46,652 --> 00:49:49,819
Thanks to Paulius giving me the slide.

837
00:49:51,431 --> 00:49:55,751
The convergence between
floating point 32 versus

838
00:49:55,751 --> 00:49:58,500
the multi tensor up, which
is basically the mixed

839
00:49:58,500 --> 00:50:00,842
precision training, are
actually pretty much

840
00:50:00,842 --> 00:50:02,932
the same for convergence.

841
00:50:02,932 --> 00:50:04,762
If you zoom it in a little bit,

842
00:50:04,762 --> 00:50:06,690
they are pretty much the same.

843
00:50:06,690 --> 00:50:11,052
And for ResNet, the mixed
precision sometimes behaves

844
00:50:11,052 --> 00:50:14,771
a little better than the
full precision weight.

845
00:50:14,771 --> 00:50:17,234
Maybe because of noise.

846
00:50:17,234 --> 00:50:20,582
But in the end, after you
train the model, this is

847
00:50:20,582 --> 00:50:24,762
the result of AlexNet,
Inception V3, and ResNet-50

848
00:50:24,762 --> 00:50:28,679
with FP32 versus FP16
mixed precision training.

849
00:50:29,881 --> 00:50:32,721
The accuracy is pretty much the same

850
00:50:32,721 --> 00:50:33,962
for these two methods.

851
00:50:33,962 --> 00:50:37,295
A little bit worse, but not by too much.

852
00:50:40,042 --> 00:50:43,714
So having talked about the
mixed precision training,

853
00:50:43,714 --> 00:50:47,881
the next idea is to train
with model distillation.

854
00:50:49,703 --> 00:50:52,412
For example, you can have
multiple neural networks,

855
00:50:52,412 --> 00:50:55,863
Googlenet, Vggnet, Resnet for example.

856
00:50:55,863 --> 00:51:00,030
And the question is, can
we take advantage of these

857
00:51:00,943 --> 00:51:02,092
different models?

858
00:51:02,092 --> 00:51:05,132
Of course we can do model
ensemble, can we utilitze them

859
00:51:05,132 --> 00:51:09,299
as teacher, to teach a small
junior neural network to have

860
00:51:11,201 --> 00:51:15,434
it perform as good as the
senior neural network.

861
00:51:15,434 --> 00:51:17,090
So this is the idea.

862
00:51:17,090 --> 00:51:21,257
You have multiple large
powerful senior neural networks

863
00:51:23,314 --> 00:51:25,202
to teach this student model.

864
00:51:25,202 --> 00:51:28,881
And hopefully it can get better results.

865
00:51:28,881 --> 00:51:32,372
And the idea to do that
is, instead of using this

866
00:51:32,372 --> 00:51:37,162
hard label, for example for
car, dog, cat, the probability

867
00:51:37,162 --> 00:51:41,329
for dog is 100%, but the
output of the geometric

868
00:51:42,383 --> 00:51:46,063
ensemble of those large
teacher neural networks

869
00:51:46,063 --> 00:51:50,230
maybe the dog has 90%
and the cat is about 10%,

870
00:51:53,282 --> 00:51:55,492
and the magic happens here.

871
00:51:55,492 --> 00:51:59,071
You want to have a
softened result label here.

872
00:51:59,071 --> 00:52:03,071
For example, the dog
is 30%, the cat is 20%.

873
00:52:03,071 --> 00:52:05,471
Still the dog is higher than the cat.

874
00:52:05,471 --> 00:52:09,260
So the prediction is
still correct, but it uses

875
00:52:09,260 --> 00:52:13,427
this soft label to train
the student neural network

876
00:52:15,431 --> 00:52:19,460
rather than use this hard label to train

877
00:52:19,460 --> 00:52:21,991
the student neural network.

878
00:52:21,991 --> 00:52:26,740
And mathematically, you
control how much do you make

879
00:52:26,740 --> 00:52:30,482
it soft by this temperature
during the soft max

880
00:52:30,482 --> 00:52:33,149
controlling by this temperature.

881
00:52:34,322 --> 00:52:36,751
And the result is that,
starting with the trained model

882
00:52:36,751 --> 00:52:40,918
that classifies 58.9% of
the test frames correctly,

883
00:52:43,099 --> 00:52:46,099
the new model converges to 57%.

884
00:52:47,340 --> 00:52:50,173
Only train on 3% of the data.

885
00:52:52,699 --> 00:52:54,882
So that's the magic for model distillation

886
00:52:54,882 --> 00:52:56,715
using this soft label.

887
00:52:59,191 --> 00:53:02,460
And the last idea is my recent paper using

888
00:53:02,460 --> 00:53:06,242
a better regularization
to train deep neural nets.

889
00:53:06,242 --> 00:53:07,908
We have seen these two figures before.

890
00:53:07,908 --> 00:53:09,929
We pruned the neural
network, having less number

891
00:53:09,929 --> 00:53:12,300
of weights, but have the same accuracy.

892
00:53:12,300 --> 00:53:15,439
Now what I did is to
recover and to retrain those

893
00:53:15,439 --> 00:53:18,271
weights shown in red
and make everything train

894
00:53:18,271 --> 00:53:21,625
out together to increase
the model capacity after

895
00:53:21,625 --> 00:53:24,887
it is trained at a low dimensional space.

896
00:53:24,887 --> 00:53:27,528
It's like you learn the trunk
first and then gradually

897
00:53:27,528 --> 00:53:31,071
add those leaves and
learn everything together.

898
00:53:31,071 --> 00:53:35,238
It turns out, on ImageNet it
performs relatively about 1% to

899
00:53:37,471 --> 00:53:41,020
4% absolute improvement of accuracy.

900
00:53:41,020 --> 00:53:44,998
And is also general purpose,
works on long-short term memory

901
00:53:44,998 --> 00:53:49,330
and also recurrent neural
nets collaborated with Baidu.

902
00:53:49,330 --> 00:53:52,610
So I also open sourced
this special training model

903
00:53:52,610 --> 00:53:56,460
on the DSD Model Zoo, where
there are trained, all

904
00:53:56,460 --> 00:54:00,490
these models, GoogleNet, VGG,
ResNet, and also SqueezeNet,

905
00:54:00,490 --> 00:54:01,969
and also AlexNet.

906
00:54:01,969 --> 00:54:05,099
So if you are interested,
feel free to check out this

907
00:54:05,099 --> 00:54:09,182
Model Zoo and compare it
with the Caffe Model Zoo.

908
00:54:11,010 --> 00:54:14,998
Here's some examples on
dense-spare-dense training helps

909
00:54:14,998 --> 00:54:16,581
with image capture.

910
00:54:17,878 --> 00:54:21,396
For example, this is a
very challenging figure.

911
00:54:21,396 --> 00:54:24,087
The original baseline of
neural talk says a boy in

912
00:54:24,087 --> 00:54:27,318
a red shirt is climbing a rock wall.

913
00:54:27,318 --> 00:54:29,179
And the sparse model says
a young girl is jumping

914
00:54:29,179 --> 00:54:31,849
off a tree, probably
mistaking the hair with either

915
00:54:31,849 --> 00:54:33,729
the rock or the tree.

916
00:54:33,729 --> 00:54:36,278
But then sparse-dense
training by using this kind of

917
00:54:36,278 --> 00:54:39,100
regularization on a low
dimensional space, it says

918
00:54:39,100 --> 00:54:42,958
a young girl in a pink shirt
is swinging on a swing.

919
00:54:42,958 --> 00:54:47,070
And there are a lot of examples
due to the limit of time,

920
00:54:47,070 --> 00:54:49,129
I will not go over them one by one.

921
00:54:49,129 --> 00:54:51,150
For example, a group of
people are standing in front

922
00:54:51,150 --> 00:54:53,118
of a building, there's no building.

923
00:54:53,118 --> 00:54:55,630
A group of people are walking in the park.

924
00:54:55,630 --> 00:54:58,550
Feel free to check out the
paper and see more interesting

925
00:54:58,550 --> 00:54:59,383
results.

926
00:55:01,420 --> 00:55:05,587
Okay finally, we come to
hardware for efficient training.

927
00:55:06,478 --> 00:55:08,929
How to we take advantage of the algorithms

928
00:55:08,929 --> 00:55:10,089
we just mentioned.

929
00:55:10,089 --> 00:55:14,060
For example, parallelism,
mixed precision, how are

930
00:55:14,060 --> 00:55:16,630
the hardware designed to actually

931
00:55:16,630 --> 00:55:19,297
take advantage of such features.

932
00:55:21,958 --> 00:55:26,041
First GPUs, this is the
Nvidia PASCAL GPU, GP100,

933
00:55:28,950 --> 00:55:31,367
which was released last year.

934
00:55:32,289 --> 00:55:35,789
So it supports up to 20 Teraflops on FP16.

935
00:55:38,048 --> 00:55:40,849
It has 16 gigabytes of
high bandwidth memory.

936
00:55:40,849 --> 00:55:42,932
750 gigabytes per second.

937
00:55:46,060 --> 00:55:49,430
So remember, computation
and memory bandwidth are

938
00:55:49,430 --> 00:55:53,350
the two factors determines
your overall performance.

939
00:55:53,350 --> 00:55:57,041
Whichever is lower, it will suffer.

940
00:55:57,041 --> 00:56:01,124
So this is a really high
bandwidth, 700 gigabytes

941
00:56:02,209 --> 00:56:06,376
compared with DDR3 is just 10
or 30 gigabytes per second.

942
00:56:08,189 --> 00:56:10,022
Consumes 300 Watts and

943
00:56:14,147 --> 00:56:17,278
it's done in 16 nanometer process

944
00:56:17,278 --> 00:56:20,945
and have a 160 gigabytes
per second NV Link.

945
00:56:22,248 --> 00:56:25,048
So remember we have
computation, we have memory,

946
00:56:25,048 --> 00:56:28,307
and the third thing is the communication.

947
00:56:28,307 --> 00:56:31,547
All three factors has to
be balanced in order to

948
00:56:31,547 --> 00:56:33,797
achieve a good performance.

949
00:56:35,088 --> 00:56:39,171
So this is very powerful,
but even more exciting,

950
00:56:40,558 --> 00:56:44,739
just about a month ago,
Jensen released the newest

951
00:56:44,739 --> 00:56:48,077
architecture called the Volta GPUs.

952
00:56:48,077 --> 00:56:50,877
And let's see what is
inside the Volta GPU.

953
00:56:50,877 --> 00:56:55,044
Just released less than a
month ago, so it has 15 of

954
00:56:57,568 --> 00:57:01,651
FP32 teraflops and what
is new here, there is 120

955
00:57:03,950 --> 00:57:08,128
Tensor T-OPS, so specifically
designed for deep learning.

956
00:57:08,128 --> 00:57:11,207
And we'll later cover
what is the tensor core.

957
00:57:11,207 --> 00:57:13,957
And what is this 120 coming from.

958
00:57:16,368 --> 00:57:19,699
And rather than 750
gigabytes per second, this

959
00:57:19,699 --> 00:57:24,499
year, the HBM2, they are
using 900 gigabytes per second

960
00:57:24,499 --> 00:57:25,678
memory bandwidth.

961
00:57:25,678 --> 00:57:27,190
Very exciting.

962
00:57:27,190 --> 00:57:32,139
And 12 nanometer process has
a die size of more than 800

963
00:57:32,139 --> 00:57:33,248
millimeters square.

964
00:57:33,248 --> 00:57:37,310
A really large chip and
supported by 300 gigabytes per

965
00:57:37,310 --> 00:57:38,477
second NVLink.

966
00:57:40,931 --> 00:57:44,880
So what's new in Volta, the
most interesting thing for us

967
00:57:44,880 --> 00:57:49,251
for deep learning, is this
thing called Tensor Core.

968
00:57:49,251 --> 00:57:51,629
So what is a Tensor Core?

969
00:57:51,629 --> 00:57:56,200
Tensor Core is actually
an instruction that can

970
00:57:56,200 --> 00:58:00,987
do the four by four matrix
times a four by four matrix.

971
00:58:00,987 --> 00:58:05,429
The fused FMA stands Fused
Multiplication and Add

972
00:58:05,429 --> 00:58:08,491
in this mixed precision operation.

973
00:58:08,491 --> 00:58:11,074
Just in one single clock cycle.

974
00:58:12,939 --> 00:58:15,698
So let's discern for a little
bit what does this mean.

975
00:58:15,698 --> 00:58:19,865
So mixed precision is exactly
as we mentioned in the last

976
00:58:20,699 --> 00:58:24,866
chapter, so we are having
FP16 for the multiplication,

977
00:58:26,430 --> 00:58:30,430
but for accumulation, we
are doing it with FP32.

978
00:58:31,928 --> 00:58:35,870
That's where the mixed
precision comes from.

979
00:58:35,870 --> 00:58:38,657
So let's say how many
operations, if it's four

980
00:58:38,657 --> 00:58:43,030
by four by four, it's 64
multiplications then just

981
00:58:43,030 --> 00:58:45,000
in one single cycle.

982
00:58:45,000 --> 00:58:48,920
That's 12x increase in
the speedup of the Volta

983
00:58:48,920 --> 00:58:53,087
compared with the Pascal, which
is released just less year.

984
00:58:55,099 --> 00:58:59,590
So this is the result for
matrix multiplication on

985
00:58:59,590 --> 00:59:01,288
different sizes.

986
00:59:01,288 --> 00:59:05,455
The speedup of Volta over
Pascal is roughly 3x faster

987
00:59:08,928 --> 00:59:11,845
doing these matrix multiplications.

988
00:59:13,368 --> 00:59:16,790
What we care more is not
only matrix multiplication

989
00:59:16,790 --> 00:59:19,958
but actually running the deep neural nets.

990
00:59:19,958 --> 00:59:23,048
So both for training and for inference.

991
00:59:23,048 --> 00:59:26,630
And for training on
ResNet-50, by taking advantage

992
00:59:26,630 --> 00:59:29,998
of this Tensor Core in this V100,

993
00:59:29,998 --> 00:59:33,581
it is 2.4x faster than
the P100 using FP32.

994
00:59:38,887 --> 00:59:43,054
So on the right hand side,
it compares the inference

995
00:59:43,899 --> 00:59:48,066
speedup, given a 7 microsecond
latency requirement.

996
00:59:50,138 --> 00:59:53,910
What is the number of images
per second it can process?

997
00:59:53,910 --> 00:59:56,459
It has a measurement of throughput.

998
00:59:56,459 --> 01:00:00,292
Again, the V100 over
P100, by taking advantage

999
01:00:03,796 --> 01:00:07,796
of the Tensor Core, is
3.7 faster than the P100.

1000
01:00:13,887 --> 01:00:18,745
So this figure gives roughly
an idea, what is a Tensor Core,

1001
01:00:18,745 --> 01:00:22,287
what is an integer unit, what
is a floating point unit.

1002
01:00:22,287 --> 01:00:23,954
So this whole figure

1003
01:00:27,705 --> 01:00:28,872
is a single SM

1004
01:00:33,065 --> 01:00:35,004
stream multiprocessor.

1005
01:00:35,004 --> 01:00:39,495
So SM is partitioned into
four processing blocks.

1006
01:00:39,495 --> 01:00:41,763
One, two, three, four, right?

1007
01:00:41,763 --> 01:00:45,846
And in each block there
are eight FP64 cores here

1008
01:00:48,105 --> 01:00:52,105
and 16 FP32 and 16 INT32
cores here, units here.

1009
01:00:55,751 --> 01:01:00,353
And then there are two of
the new mixed precision

1010
01:01:00,353 --> 01:01:04,520
Tensor cores specifically
designed for deep learning.

1011
01:01:07,641 --> 01:01:10,684
And also there are the one
warp scheduler, dispatch unit

1012
01:01:10,684 --> 01:01:13,513
and Register File, as before.

1013
01:01:13,513 --> 01:01:17,596
So what is new here is
the Tensor core unit here.

1014
01:01:18,935 --> 01:01:23,102
So here is a figure comparing
the recent generations of

1015
01:01:25,722 --> 01:01:27,639
Nvidia GPUs from Kepler

1016
01:01:29,164 --> 01:01:31,664
to Maxwell to Pascal to Volta.

1017
01:01:34,722 --> 01:01:37,425
We can see everything
is keeping improving.

1018
01:01:37,425 --> 01:01:40,733
For example, the boost clock
has been increased from

1019
01:01:40,733 --> 01:01:42,816
about 800 MHz to 1.4 GHz.

1020
01:01:46,563 --> 01:01:50,730
And from the Volta generation
there begins to have

1021
01:01:52,855 --> 01:01:57,022
the Tensor core units here,
which has never existed before.

1022
01:01:59,241 --> 01:02:01,158
And before the Maxwell,

1023
01:02:02,364 --> 01:02:04,781
the GPUs are using the GDDR5,

1024
01:02:07,924 --> 01:02:10,662
and after the Pascal GPU,

1025
01:02:10,662 --> 01:02:12,993
the HBM begins to came into place,

1026
01:02:12,993 --> 01:02:14,593
the high-bandwidth memory.

1027
01:02:14,593 --> 01:02:17,093
750 gigabytes per second here.

1028
01:02:18,543 --> 01:02:22,804
900 gigabytes per second
compared with DDR3,

1029
01:02:22,804 --> 01:02:24,804
30 gigabytes per second.

1030
01:02:27,364 --> 01:02:31,531
And memory size actually
didn't increase by too much,

1031
01:02:34,204 --> 01:02:36,593
and the power consumption is actually

1032
01:02:36,593 --> 01:02:38,783
also remaining roughly the same.

1033
01:02:38,783 --> 01:02:41,844
But giving the increase of
computation, you can fit them

1034
01:02:41,844 --> 01:02:46,712
in the fixed power envelope
that's still an exciting thing.

1035
01:02:46,712 --> 01:02:49,433
And the manufacturing process
is actually improving from

1036
01:02:49,433 --> 01:02:53,600
28 nanometer, 16 nanometer,
all the way to 12 nanometer.

1037
01:02:55,295 --> 01:02:58,033
And the chip area are also increasing to

1038
01:02:58,033 --> 01:03:01,616
800 millimeter-squared,
that's really huge.

1039
01:03:03,084 --> 01:03:07,513
So, you may be interested
in the comparison of the GPU

1040
01:03:07,513 --> 01:03:09,663
with the TPU, right?

1041
01:03:09,663 --> 01:03:12,463
So how do they compare with each other?

1042
01:03:12,463 --> 01:03:15,023
So in the original TPU paper,

1043
01:03:15,023 --> 01:03:18,797
TPU actually designed
roughly in the year of 2015,

1044
01:03:18,797 --> 01:03:22,464
and this is comparison
of the Pascal P40 GPU

1045
01:03:23,673 --> 01:03:25,090
released in 2016.

1046
01:03:27,815 --> 01:03:30,924
So, TPU, the power consumption is lower,

1047
01:03:30,924 --> 01:03:34,273
is larger on chip memory of 24 megabytes,

1048
01:03:34,273 --> 01:03:38,015
really large on-chip SRAM
managed by the software.

1049
01:03:38,015 --> 01:03:42,593
And then both of them
support INT8 operations,

1050
01:03:42,593 --> 01:03:46,760
while the inferences per second
given a 10 nanometer latency

1051
01:03:47,764 --> 01:03:50,484
the comparison for TPU is 1X.

1052
01:03:50,484 --> 01:03:52,651
For the P40 it's about 2X.

1053
01:03:57,975 --> 01:03:59,558
So, just last week,

1054
01:04:01,682 --> 01:04:03,655
in the Google I/O,

1055
01:04:03,655 --> 01:04:06,421
a new nuclear bomb is landed on the Earth.

1056
01:04:06,421 --> 01:04:09,251
That is the Google Cloud TPU.

1057
01:04:09,251 --> 01:04:13,203
So now TPU not only support inference,

1058
01:04:13,203 --> 01:04:15,353
but also support training.

1059
01:04:15,353 --> 01:04:18,622
So there is a very limited
information we can get

1060
01:04:18,622 --> 01:04:20,873
beyond this Google Blog.

1061
01:04:20,873 --> 01:04:24,790
So their Cloud TPU delivers
up to 180 teraflops

1062
01:04:28,713 --> 01:04:32,130
to train and run machine learning models.

1063
01:04:33,422 --> 01:04:36,820
And this is multiple Cloud TPU,

1064
01:04:36,820 --> 01:04:38,903
making it into a TPU pod,

1065
01:04:40,110 --> 01:04:44,963
which is built with 16
the second generation TPUs

1066
01:04:44,963 --> 01:04:48,542
and delivers up to 11.5 teraflops

1067
01:04:48,542 --> 01:04:50,873
of machine learning acceleration.

1068
01:04:50,873 --> 01:04:53,862
So in the Google Blog, they mentioned that

1069
01:04:53,862 --> 01:04:56,420
one of the large scale translation models,

1070
01:04:56,420 --> 01:05:00,881
Google translation models, used
to take a full day to train

1071
01:05:00,881 --> 01:05:05,048
on 32 of best commercially-available
GPUs, probably P40

1072
01:05:06,731 --> 01:05:07,981
or P100, maybe.

1073
01:05:08,902 --> 01:05:11,380
And now it trains to the same accuracy,

1074
01:05:11,380 --> 01:05:15,547
just within one afternoon,
with just 1/8 of a TPU pod,

1075
01:05:17,523 --> 01:05:19,606
which is pretty exciting.

1076
01:05:22,611 --> 01:05:25,273
Okay, so as a little wrap-up.

1077
01:05:25,273 --> 01:05:27,662
We covered a lot of stuff, we've mentioned

1078
01:05:27,662 --> 01:05:30,763
the four dimension space
of algorithm and hardware,

1079
01:05:30,763 --> 01:05:33,993
inference and training, we
covered the algorithms for

1080
01:05:33,993 --> 01:05:36,982
inference, for example,
pruning and quantization,

1081
01:05:36,982 --> 01:05:40,251
Winograd Convolution, binary, ternary,

1082
01:05:40,251 --> 01:05:42,174
weight sharing, for example.

1083
01:05:42,174 --> 01:05:44,603
And then the hardware for
the efficient inference.

1084
01:05:44,603 --> 01:05:46,353
For example, the TPU,

1085
01:05:48,665 --> 01:05:52,523
that take advantage of INT8, integer 8.

1086
01:05:52,523 --> 01:05:56,464
And also my design of EIE
accelerator that take advantage

1087
01:05:56,464 --> 01:05:59,951
of the sparsity, anything
multiplied by zero is zero,

1088
01:05:59,951 --> 01:06:03,201
so don't store it, don't compute on it.

1089
01:06:04,260 --> 01:06:07,131
And also the efficient algorithm
for training, for example,

1090
01:06:07,131 --> 01:06:11,312
how do we do parallelization
and the most recent research on

1091
01:06:11,312 --> 01:06:14,901
how do we use mixed precision
training by taking advantage

1092
01:06:14,901 --> 01:06:18,151
of FP16 rather than FP32 to do training

1093
01:06:19,131 --> 01:06:22,131
which is four times saving the energy

1094
01:06:22,131 --> 01:06:23,939
and four times saving in the area,

1095
01:06:23,939 --> 01:06:27,731
which doesn't quite sacrifice
the accuracy you'll get from

1096
01:06:27,731 --> 01:06:28,814
the training.

1097
01:06:31,803 --> 01:06:35,352
And also Dense-Sparse-Dense
training using better regularization

1098
01:06:35,352 --> 01:06:39,519
sparse regularization, and also
the teacher-student model.

1099
01:06:41,021 --> 01:06:43,741
You have multiple teacher on
your network and have a small

1100
01:06:43,741 --> 01:06:46,461
student network that you
can distill the knowledge

1101
01:06:46,461 --> 01:06:51,072
from the teacher in your
network by a temperature.

1102
01:06:51,072 --> 01:06:54,650
And finally we covered the
hardware for efficient training

1103
01:06:54,650 --> 01:06:57,580
and introduced two nuclear bombs.

1104
01:06:57,580 --> 01:07:01,747
One is the Volta GPU, the
other is the TPU version two,

1105
01:07:02,590 --> 01:07:06,507
the Cloud TPU and also
the amazing Tensor cores

1106
01:07:09,184 --> 01:07:12,771
in the newest generation of Nvidia GPUs.

1107
01:07:12,771 --> 01:07:16,632
And we also revealed the
progression of a wide range,

1108
01:07:16,632 --> 01:07:20,861
the recent Nvidia GPUs
from the Kepler K40,

1109
01:07:20,861 --> 01:07:23,461
that's actually when
I started my research,

1110
01:07:23,461 --> 01:07:25,283
what we used in the beginning,

1111
01:07:25,283 --> 01:07:28,033
all the way to and then K40, M40,

1112
01:07:29,437 --> 01:07:33,213
and then Pascal and then
finally the exciting Volta GPU.

1113
01:07:33,213 --> 01:07:37,380
So every year there is a
nuclear bomb in the spring.

1114
01:07:40,981 --> 01:07:44,992
Okay, a little look ahead in the future.

1115
01:07:44,992 --> 01:07:47,381
So in the future of the city
we can imagine there are a lot

1116
01:07:47,381 --> 01:07:52,301
of AI applications using
smart society, smart care,

1117
01:07:52,301 --> 01:07:56,504
IOT devices, smart retail,
for example, the Amazon Go,

1118
01:07:56,504 --> 01:07:59,984
and also smart home, a lot of scenarios.

1119
01:07:59,984 --> 01:08:03,995
And it poses a lot of challenges
on the hardware design

1120
01:08:03,995 --> 01:08:07,851
that requires the low
latency, privacy, mobility

1121
01:08:07,851 --> 01:08:09,355
and energy efficiency.

1122
01:08:09,355 --> 01:08:12,202
You don't want your battery
to drain very quickly.

1123
01:08:12,202 --> 01:08:15,155
So it's both challenging
and very exciting era

1124
01:08:15,155 --> 01:08:18,904
for the code design for
both the machine learning

1125
01:08:18,904 --> 01:08:20,595
deep neural network model architectures

1126
01:08:20,595 --> 01:08:23,283
and also the hardware architecture.

1127
01:08:23,283 --> 01:08:26,773
So we have moved from
PC era to mobile era.

1128
01:08:26,773 --> 01:08:29,973
Now we are in the AI-First era,

1129
01:08:29,973 --> 01:08:32,818
and hope you are as excited
as I am for this kind of

1130
01:08:32,818 --> 01:08:36,485
brain-inspired cognitive
computing research.

1131
01:08:37,773 --> 01:08:41,962
Thank you for your attention,
I'm glad to take questions.

1132
01:08:41,962 --> 01:08:44,212
[applause]

1133
01:08:50,875 --> 01:08:52,625
We have five minutes.

1134
01:08:54,323 --> 01:08:55,643
Of course.

1135
01:08:55,643 --> 01:08:59,504
- [Student] Can you commercialize
the deep architecture?

1136
01:08:59,504 --> 01:09:04,122
- The architecture, yeah, some
of the ideas are pretty good.

1137
01:09:04,122 --> 01:09:06,583
I think there's opportunity.

1138
01:09:06,584 --> 01:09:07,417
Yeah.

1139
01:09:11,841 --> 01:09:12,674
Yeah.

1140
01:09:30,091 --> 01:09:34,258
The question is, what can we
do to make the hardware better?

1141
01:09:46,997 --> 01:09:48,979
Oh, right, the question is how do we,

1142
01:09:48,979 --> 01:09:51,917
the challenges and what
opportunity for those small

1143
01:09:51,917 --> 01:09:54,699
embedded devices around
deep neural network

1144
01:09:54,699 --> 01:09:57,006
or in general AI algorithms.

1145
01:09:57,006 --> 01:10:00,673
Yeah, so those are the
algorithm I discussed

1146
01:10:02,197 --> 01:10:04,947
in the beginning about inference.

1147
01:10:06,309 --> 01:10:07,142
Here.

1148
01:10:08,579 --> 01:10:12,448
These are the techniques
that can enable such

1149
01:10:12,448 --> 01:10:15,107
inference or AI running
on embedded devices,

1150
01:10:15,107 --> 01:10:18,448
by having less number of
weights, fewer bits per weight,

1151
01:10:18,448 --> 01:10:20,648
and also quantization,
low rank approximation.

1152
01:10:20,648 --> 01:10:24,397
The small matrix, same
accuracy, even going to binary,

1153
01:10:24,397 --> 01:10:27,808
or ternary weights having just two bits

1154
01:10:27,808 --> 01:10:31,288
to do the computation rather
than 16 or even 32 bit

1155
01:10:31,288 --> 01:10:33,745
and also the Winograd Transformation.

1156
01:10:33,745 --> 01:10:36,456
Those are also the enabling
algorithms for those

1157
01:10:36,456 --> 01:10:38,706
low-power embedded devices.

1158
01:10:57,356 --> 01:11:02,189
Okay, the question is, if it's
binary weight, the software

1159
01:11:02,189 --> 01:11:06,356
developers may be not able
to take advantage of it.

1160
01:11:07,509 --> 01:11:11,418
There is a way to take
advantage of binary weight.

1161
01:11:11,418 --> 01:11:14,418
So in one register there are 32 bit.

1162
01:11:16,538 --> 01:11:19,827
Now you can think of it
as a 32-way parallelism.

1163
01:11:19,827 --> 01:11:22,457
Each bit is a single operation.

1164
01:11:22,457 --> 01:11:25,120
So say previously we
have 10 ops per second.

1165
01:11:25,120 --> 01:11:27,703
Now you get 330 ops per second.

1166
01:11:31,000 --> 01:11:33,917
You can do this bitwise operations.

1167
01:11:34,960 --> 01:11:37,287
For example, XOR operations.

1168
01:11:37,287 --> 01:11:39,368
So one register file,

1169
01:11:39,368 --> 01:11:42,285
one operation becomes 32 operation.

1170
01:11:43,608 --> 01:11:47,058
So there is a paper called XORmad,

1171
01:11:47,058 --> 01:11:49,845
they very amazing implemented

1172
01:11:49,845 --> 01:11:52,637
on the Raspberry Pi using this feature

1173
01:11:52,637 --> 01:11:55,907
to do real-time detection,
very cool stuff.

1174
01:11:55,907 --> 01:11:56,740
Yeah.

1175
01:12:11,779 --> 01:12:15,946
Yeah, so the trade-off is
always so the power area

1176
01:12:16,956 --> 01:12:19,819
and performance in general,
all the hardware design

1177
01:12:19,819 --> 01:12:23,298
have to take into account
the performance, the power,

1178
01:12:23,298 --> 01:12:24,798
and also the area.

1179
01:12:26,158 --> 01:12:29,387
When machine learning
comes, there's a fourth

1180
01:12:29,387 --> 01:12:32,107
figure of merit which is the accuracy.

1181
01:12:32,107 --> 01:12:34,089
What is the accuracy?

1182
01:12:34,089 --> 01:12:37,019
And there is a fifth one
which is programmability.

1183
01:12:37,019 --> 01:12:39,089
So how general is your hardware?

1184
01:12:39,089 --> 01:12:42,089
For example, if Google just
want to use that for AI

1185
01:12:42,089 --> 01:12:45,507
and deep learning, it's totally fine

1186
01:12:45,507 --> 01:12:48,635
that we can have a fully
very specialized architecture

1187
01:12:48,635 --> 01:12:51,206
just for deep learning
to support convolution,

1188
01:12:51,206 --> 01:12:54,307
multi-layered perception,
long-short-term memory,

1189
01:12:54,307 --> 01:12:58,224
but GPUS, you also want
to have support for those

1190
01:13:00,067 --> 01:13:03,734
scientific computing
or graphics, AR and VR.

1191
01:13:04,915 --> 01:13:07,998
So that's a difference, first of all.

1192
01:13:10,804 --> 01:13:14,244
And TPU basically is a ASIC, right?

1193
01:13:14,244 --> 01:13:16,987
It's a very fixed function
but you can still program it

1194
01:13:16,987 --> 01:13:21,587
with those coarse instructions
so people from Google

1195
01:13:21,587 --> 01:13:24,755
roughly designed those coarse
granularity instruction.

1196
01:13:24,755 --> 01:13:27,467
For example, one instruction
just load the matrix,

1197
01:13:27,467 --> 01:13:29,795
store a matrix, do convolutions,

1198
01:13:29,795 --> 01:13:31,507
do matrix multiplications.

1199
01:13:31,507 --> 01:13:34,377
Those coarse-grain instructions

1200
01:13:34,377 --> 01:13:37,710
and they have a software-managed memory,

1201
01:13:38,605 --> 01:13:40,558
also called a scratchpad.

1202
01:13:40,558 --> 01:13:43,885
It's different from
cache where it determines

1203
01:13:43,885 --> 01:13:47,217
where to evict something
from the cache, but now,

1204
01:13:47,217 --> 01:13:49,845
since you know the computation pattern,

1205
01:13:49,845 --> 01:13:53,512
there's no need to do out-of-order execution,

1206
01:13:54,446 --> 01:13:57,066
to do branch prediction, no such things.

1207
01:13:57,066 --> 01:14:00,255
Everything is determined,
so you can take the multi of

1208
01:14:00,255 --> 01:14:04,422
it and maintain a fully
software-managed scratchpad

1209
01:14:05,337 --> 01:14:09,897
to reduce the data movement
and remember, data movement

1210
01:14:09,897 --> 01:14:13,084
is the key for reducing
the memory footprint

1211
01:14:13,084 --> 01:14:14,606
and energy consumption.

1212
01:14:14,606 --> 01:14:15,439
So, yeah.

1213
01:14:26,633 --> 01:14:30,313
Mobilia and Nobana architectures
actually I'm not quite

1214
01:14:30,313 --> 01:14:33,813
familiar, didn't prepare those slides, so,

1215
01:14:34,736 --> 01:14:37,569
comment it a little bit later, no.

1216
01:14:52,428 --> 01:14:54,507
Oh, yeah, of course.

1217
01:14:54,507 --> 01:14:57,778
Those are always and
can certainly be applied

1218
01:14:57,778 --> 01:15:00,269
to low-power embedded devices.

1219
01:15:00,269 --> 01:15:03,686
If you're interested, I can show you a...

1220
01:15:04,629 --> 01:15:05,462
Whoops.

1221
01:15:06,971 --> 01:15:08,888
Some examples of, oops.

1222
01:15:10,689 --> 01:15:11,859
Where is that?

1223
01:15:11,859 --> 01:15:15,731
Of my previous projects
running deep neural nets.

1224
01:15:15,731 --> 01:15:19,394
For example, on a drone,
this is using a Nvidia TK1

1225
01:15:19,394 --> 01:15:23,561
mobile GPU to do real-time
tracking and detection.

1226
01:15:26,691 --> 01:15:28,898
This is me playing my nunchaku.

1227
01:15:28,898 --> 01:15:32,898
Filmed by a drone to do the
detection and tracking.

1228
01:15:34,672 --> 01:15:38,939
And also, this FPGA doing
the deep neural network.

1229
01:15:38,939 --> 01:15:41,039
It's pretty small.

1230
01:15:41,039 --> 01:15:44,611
This large, doing the face-alignment and

1231
01:15:44,611 --> 01:15:48,194
detecting the eyes,
the nose and the mouth,

1232
01:15:49,352 --> 01:15:51,602
at a pretty high framerate.

1233
01:15:53,151 --> 01:15:55,401
Consuming only three watts.

1234
01:15:56,918 --> 01:16:00,689
This is a project I did
at Facebook doing the

1235
01:16:00,689 --> 01:16:03,269
deep neural nets on the mobile phone to do

1236
01:16:03,269 --> 01:16:06,781
image classification, for
example, it says it's a laptop,

1237
01:16:06,781 --> 01:16:10,389
or you can feed it with
an image and it says

1238
01:16:10,389 --> 01:16:14,480
it's a selfie, has person
and the face, et cetera.

1239
01:16:14,480 --> 01:16:17,621
So there's lots of opportunity for those

1240
01:16:17,621 --> 01:16:21,788
embedded or mobile-deployment
of deep neural nets.

1241
01:16:30,419 --> 01:16:32,288
No, there is a team doing that,

1242
01:16:32,288 --> 01:16:34,808
but I cannot comment too much, probably.

1243
01:16:34,808 --> 01:16:38,975
There is a team at Google
doing that sort of stuff, yeah.

1244
01:16:44,876 --> 01:16:46,208
Okay, thanks, everyone.

1245
01:16:46,208 --> 00:00:00,000
If you have any questions,
feel free to drop me a e-mail.